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PREFACE 


This volume is the Proceedings of the 1983 International Conference on Parallel Process- 
ing, the twelfth in a series of annual meetings. This year’s conference represents the largest 
yet, both in number of papers submitted and number of papers presented at the conference. 
The previous records were 136 papers submitted (1981) and 67 papers presented (1982). This 
year 240 papers were submitted! The final program contains 97 papers from twelve countries. 
Academia, industry, and government research labs are all represented. 

Because of the large number of excellent papers submitted, the task of arriving at a pro- 
gram was an extremely difficult one. For the first time, the Parallel Processing Conference 
will have parallel sessions in order to accommodate more papers. Even at that, there were 
many very good papers which it was not possible to include, and there were many papers sub- 
mitted as regular papers which were accepted as short papers. On the positive side, because 
of the large number of submissions, all of the papers finally accepted and included in this 
proceedings are of the highest quality. We sincerely thank ali of the authors who submitted 
papers for their interest in the conference. 

Special thanks go to the 379 referees who read and evaluated the manuscripts. Each sub- 
mitted paper was sent to three referees. Without the efforts of these reviewers, the task of 
arriving at a program would have been virtually impossible. The names of the referees are 
listed in these proceedings. 

We wish to thank Dr. Ben Coates, Head of the Electrical Engineering School at Purdue, 
for his support and encouragement. We thank Dee Dee Dexter, Carol Edmundson, and Jenny 
Hiatt for their excellent job in managing the massive amounts of paperwork involved in han- 
dling 240 submissions, 720 reviews, and 97 accepted papers. We also want to thank Wanda 
Booth, Andy Hughes, Sharon Katz, Mickey Krebs, Nancy Lein, Pat Loomis, Vicky Spence, 
and Linda Stovall for their assistance. Ken Batcher provided valuable information about the 
running of the 1982 conference. We thank Ming T. Liu and Chuan-Lin Wu for handling the 
reviewing of the papers submitted from Purdue. 

Finally, we wish to acknowledge the efforts of Tse-Yun Feng. As always, he has been a 
one-man conference organizing committee, handling publicity, local arrangements, budget, and 
all other facets of the conference. We thank him for giving us the opportunity to chair this 


year’s technical program. 


H. J. Siegel 
Leah Siegel 
Program Co-Chairs 


Purdue University 
June 1983 
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ABSTRACT 


An interference analysis of the Interconnec- 
tion Networks (INs) for a tightly coupled mlti- 
processor is presented in this paper. The inter- 
connections considered are crossbars and delta 
networks. Two situations are examined: when a 
memory module is equally likely to be addressed 
by a processor and when a processor has a favor- 
ite memory. It is shown that for a higher rate 
of favorite requests, the delta networks perform 
close to a crossbar. 


INTRODUCTION 


A multiprocessor architecture can be broadly 
divided into two categories: loosely coupled and 
tightly coupled. In a loosely coupled multi- 
processor, each processor has a local memory and 
the communication between the Processing Elements 
(PEs) is accomplished through an Interconnection 
Network (IN). A PE essentially consists of a 
processor and its local memory. In a tightly 
coupled system, the processors are connected to 
one side of the IN and the memory modules are 
connected to the other side. The IN is capable 
of connecting a processor to any one of the 
memory modules. The loosely coupled and tightly 
coupled architectures are illustrated in Fig. l. 
In this paper, we consider an interference analy- 
Sis of the INs for a tightly coupled multi- 
processor. 


A crossbar interconnection [1] allows all 
possible one-to-one and simultaneous connections 
between the processors and the memory modules. 
When two or more processors try to access the 
same memory, only one of them will be connected 
and the rest will be blocked or rejected. Band 
width (BW) is defined as the expected number of 
memory requests accepted per cycle or the average 
number of memory modules remaining busy in a 
cycle. Clearly, this is a parameter which speci- 
fies as to what extent an IN is efficient. The 
interference analysis of an MxN crossbar for M 
processors and N memory modules, when a processor 
is equally likely to address any one of the N 
common memories, is well known [2-4]. However, 
in a practical situation, a processor is likely 
to address a particular memory most of the time 
except when an interprocessor communication is 
necessary. If processor i (P,) communicates 
more often with a memory module i (MM, ), we will 
call MM,as a favorite memory of P, and P, as 
a favorite processor of MM, . We will assume 
that we have a prior knowledge of a factor mo 
which is the probability that P, addresses MM, 
provided that P,; generates a request. When 
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m = — , a processor is equally likely to address 


any one of the N memory modules and the favor- 
ite case reduces to an equally likely case. In 
this paper, we carry out an analysis for such a 
favorite memory case for an MxN crossbar switch 
when M=N ,M2N and M<K<N. 


Because of the O(N?) switch complexity of 
an NxN crossbar, Multistage Interconnection 
Networks (MINs) have been proposed recently for 
large values of N . Several MINs such as Omega 
[5], Indirect binary n-cube [6], Generalized cube 
[7] and Baseline [8] are known. An NxN MIN 
basically employs logoN stages of 2x2 switches 
with N/2 number of switches per stage. It is 
capable of performing a subset of one-to-one and 
simultaneous mappings while reducing the cost to 
O(N log,N) ; The mappings or permutations, 
achieved by one network, may be different than 
another depending on the interconnection used 
between the stages. However, these MINs are all 
functionally equivalent in terms of their BWs and 
the total number of permutations, achieved. 
Interference analysis of such MINs have been 
reported in a few papers [4,9-11] for equally 
likely cases. The VLSI performance of these 
networks have also been studied [12,13] when the 
whole network is fabricated on a single chip. In 
terms of area*delay characteristics the MINs do 
not perform that well compared to the crossbars, 
as they do in an SSI implementation. 


Delta network [4] is a self routing inter- 
connection network that connects M = a” inputs 
to N = b” outputs through n_ stages of a x b 
crossbar switches. All the MINs form a class of 
Delta networks with a= b=2. A still braoder 
class of networks called Radix Shuffle Networks 
(RSNs) was introduced recently [11] for connect- 
ing M processors to N memory modules for arbit- 
rary values of M and N. If M and N can be 
factored into 'r' components as M = m, x my x -e- 
x my) and N = ny x No X «e+ KN, , an RSN con- 
sists of '‘'r' stages of switches with the ith 
stage employing m,; x Ny crossbar modules. 
Delta is a special case of the RSN when all m,'s 
are equal to a and all n,'s are equal to b. 
All the above cited networks form a part of the 
Banyan networks [14], introduced for partitioning 
multiprocessor systems. Interference analysis of 
the RSNs was also reported [11] when a processor 
is equally likely to address a memory module. We 
carry out here an analysis for the RSNs for the 
favorite memory case. The results derived for a 
crossbar are successfully applied to the RSNs. 
Because of the complexity involved, we restrict 
our analysis to NxN Delta networks only. The 
tneoreticai resuits matcn witn those obtained 
from simulations. 


ANALYSIS OF CROSSBAR 


A crossbar is capable of connecting M pro- 
cessors to N memory modules for any arbitrary 
values of M and N [1]. The analyses given here 
are based on the following assumptions. 


1. The crossbar operates in a synchronous mode 
i.e. the requests issued by the processors 
begin and end simultaneously. 

2. The requests are random and the request 
generated by a processor is independent of 
the request generated by another processor. 

3. Requests which are not accepted are blocked 
or rejeted. 

4. The requests generated in a cycle are inde- 
pendent of the requests generated in the 
previous cycle. 

5s Po is the probability with which a processor 
generates a request. Thus Po is the rate 
of request of a processor per cycle. 

6. m is the probability with which processor 
P; addresses memory MM; given that P,; 
generates a request. Thus mep, is the rate 
of request of a processor directed to its 
favorite memory. 


In an MIMD [15] operation, the memory 
requests are asynchronous. Various simulations 
[3,4] indicate that assumption 1 does not bring 
in a substantial difference in the results. When 
asynchronous operation is assumed, buffers should 
be provided in the switches [9]. Assumption 4 is 
unrealistic because the requests rejected in a 
cycle will indeed be resubmitted in the next 
cycle. This assumption leads to amazingly sim- 
pled closed form equations for a crossbar [3] and 
produces negligible discrepancies in the result 
[2]. Assumption 5 indicates that a processor 
need not send a request in every cycle. Assump- 
tion 6 considers memory module MM, as a favorite 
memory of the processor Ps - In an MxN crossbar 
for an equally likely case, a pEOcesror addresses 
a memory with a probability of —. MM; is 


pons tdered a favorite memory of P. only if m> 
— . The values of Po and m are program 


dependode and can be determined. With a prob- 
ability Po of a processor generating a request, 
the probability qd; that j requests are 
generated by M _ processors is given by: 


405) = C1) Pg ¢ Cm pg) 


where ( M ) ais the binomial coerticient. 
J 


(a) Equally likely case for MxN crossbar 


This is a situation where a processor is 
equally likely to address any one of the N 
memory modules. The probability thet a processor 
addresses a particular memory is —, given that 


the processor generates a request. Probability 
that a memory module is addressed by k_ process- 
ors, given that j requests are generated by the 
processors; 
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Subscript ‘'e' stands for equally likely case. 


For various values of k ranging from 1 to j, 
the rate of request at a memory module; 


7 j yc L yka - L yak 
Pe(4) oes CJ (Ska - =) 
= § «J yoho*a-='£57* <0 2b) 
lc oO q 
Se Beare 


With each processor having a probability Po of 
generating a request, the total rate of request 
at a memory module is given by; 


ee, eC) ae 1G) 
O<j<M 
= 2 OB as) 2 De 
O<j<M : 
Po \M 
=e ae 0 ee (1) 
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BW is the average number of memory modules 
remaining busy in a cycle. 

In other words BW = rate of request at a memory 
module * the number of memory modules. 

Hence, for equally likely case, 


BW = N{a-(1 - £2 y%4 : (2) 
e N 


(b) Favorite memory case for NxN crossbar 


Let m be the prbability that processor Py 
requests memory module MM; given that P, 
generates a request. Hence, the probability of 
P; requesting MM; , P; > MM; = poem - 
Probability that P, does not request MM; ; 
Pix My = l-p oem . 


Given another processor P. for i# Jy 
which generates a request; P. > MM; = (1-m) or 
= x say and a MM; =l-x. 


In a situation when there are a total of j 
requests of which k requests arrive at MM, 5 
two distinct possibilities can occur; 


P; > MM; and (k-1) other processors > MM; 


or Pf MM, and k other processors > MM; . 


The rate of request at MM; given j requests 
at the input side; 


y {pgm e Git) » xk Ga-x)s* 
l<k<j 
+(1-pom)(Fe4) xk (1-x) IK} 


PE(j) ~ 


Lee Tp ae ge (3) 


The subscript ss i stands for the favorite 
memory case. 


With a probability Po of P. generating a 
request and with (N-1) other processors besides 
P; , the total rate of request at MM, 3 


—] j-1 ; es e 
De = C37) © pd* + C-pg)8 + peggy 


E 
O< j-1<N-1 
= = (1-p,m)(1-p,x) N+ ‘ 


; 1-m 1l-m,N-1 
N-1 > Pr Po Po N-1 


and BW, = N{1 - (1 - poem)(1-p, = aN}. (5) 


Lim BWe = p,N , which means that if all the 
m>1 


requests were favorite, the BW is equal to the 


number of requests generated; so all the requests 
are accepted. Equally likely is a special case 
of the favorite memory case with m = — 
/ 
The BW of an NxN crossbar are plotted in 
Fig. 2 both for favorite memory case with m = 
0.8 and an equally likely case. With a favorite 
memory, a processor remains busy with its favor- 
ite memory module most of the time. As a result, 
less conflicts occur which in turn give rise to a 
higher BW. A favorite case for NxN crossbars can 
also be visualized as shown in Fig. 3a. Rate of 
request at MM; due to P; 3; pa = ppm. Rate of 
request at MM; due to (N-1) other processors; 
1- F 
Sa # {1 - Pok m) \N-1 
N-1 
for (N-1) processors and (N-1) non-favorite 
memories. Because of the assumption 2, these two 
rates are statistically independent of each 
other. Hence, the total rate of request at MM; ; 


» similar to eqn. (1) 


Pag ~ Pa * Pg ~ Pa * Pp 


= _ _ l-m \N-1 | ~ 
mow ee Roce 
- _. l-m ,N-1l 


1-(1-pym)(1-pg SBN" = pe in eqn. (4). 


(c) Favorite memory case for MxN crossbars with 
M>N or eae oe, 


The situation is depicted in Fig. 3b. The 
processors are divided into two groups. Group A 
consists of N processors having favorite 
memories and group.B consists of M-N processors 
that are equally likely to address any ong of the 
memory modules with a probability of —. The 


rate of request at MM; due to the processors 
belonging to group A ; 
l-m,N-1, 


Pa = 1-(1-p,m)(1-p, i 3; same as in eqn. (4). 


The rate of request at MM; due to the processors 
belonging to group B; 


Pp = ; 
Pp = cea ta Rt B ; from eqn. (1) . 


Since the request rates are statistically 
independent, the overall request rate at MM; ; 


Pe = Pa t+ Pp Pa * Pp 


l-m,\N-1 Po. M-N 

I=Cl- Pom) (1 aa Po ae, ls = ha - (6) 
1-m,N-1 Po.M-N 

BWe = N{1-(1-p,m)(1-p, 7 (1 - ae }. (7) 


When M=N , eqn. (7) reduces oa we (5). 
When m = i, BW = N{1 ee — which 


N 
is same as the equally likely case. 


(d) Favorite memory case for MxN crossbars with 
N>M 


The situation is depicted in Fig. 3c. The 
memory modules are divided into two groups A 
and B. Group A consists of M favorite memories 
and group B consists of (N-M) memories that are 
equally addressed by a _processor with a 
probability x = (l-m) » » given that the 


processor generates a request. 


Given that there are j requests generated 
by the processors including processor P,; , the 
rate of request at MM; belonging to group A; 
Pacgy 72 - pom)(1-x)J7! 5 from eqn. (3). 
With the proessors having a probability of 
request py, >» the total rate of request at MM; 
belonging to group A; 

= M-1, _j-1l = M- j 
Pa = ) (Gas Cg) Bary 
O< j-1<M-1 


1 - (1 - pom - pox? . 


, 1l-m 1-m,M-1 
With <= =—:% = 1-(1 - nyc: =. p..-——) ; 
no.” PA Po ° N-1 
A processor addresses a memory module belonging 
to group B with a probability of a=. Prob- 


ability of generation of a request being p, , the 
rate of request at MM belonging to group B; 


Pp = l- Cl - py amy ; from eqn. (1) 
N-1 
Then . 
BWe = pa e M+ pp © (N-M) 


M{1-(1-p,M)(1-p, ty 
+(N-M) {1-(1-p, 2)" 


= NM(1-poM)(1-py SE) (NM) (L-pg SEM 


- (8) 
Again with m=, BW, = N-N(1 - 2)" = BW, . 
BW, obtained with M = 16 are plotted in Fig. 4 


for various values of N. Compared to the BW for 
equally likely case (BW) » there is a fast 
increase in BW, with increase in N for N<«< 
16. The rate of increase in BW in Fig. 5 for 
N fixed at 16, is also similar to that obtained 
in Fig. 4. The difference between BW, and BW, 
is maximum when M is equal to N . This is 
reasonable because the maximum possible BW is 
limited to Min{M,N} irrespective of whether a 
favorite case or not. 


ANALYSIS OF DELTA NETWORKS 


A delta network [4] is a multistage inter- 
connection network that connects M =a" inputs 
to N= b” outputs through n= stages of a x b 
crossbar modules. The ach jetase of the delta 

b 


network consists of M number of a x b 


a ° 
crossbar switches and produces m(Pyi outputs. 


An interference analysis of these “networks for 
equally likely case is presented in [4]. The 
analysis is based on a recursive computation of 
the rate of request at a stage. The rate of 
request on an output line of the ith stage is: 


iy = Lae (9) 
b 

where p, is the probability of generation of a 
request by a processor. With a given value of 
Po » the final rate of request at an output line 
of the delta network can be computed using the 
above recursive equation. Then BW =Nep, .- 
We develop here such a recursive analysis for 
delta networks as applicable to the favorite 
memory case. Because of the complexity involved, 
we restrict our analysis to NxN delta networks 
only. An NxN delta network with N=a" , consists 
of n stages of axa crossbar modules with — 


a 
such modules per. stage. The interconnection 
between the stages is an a-shuffle of the inputs. 


S, » the a-~shuffle of an integer j is given 


by; 
S = a j mod(N-1) for O< 
‘ ae 


5 < N-l 
j for 1 


j 
N-l . (10) 


Omega network [5] is a special case of delta 
network with a=2 . We define that a processor 
is connected to its favorite module when all the 
switches are in straight connection as shown in 
Fig. 6 for an 8x8 omega network. When all the 
switches are connected straight in a delta net-— 
work with a-shuffle interconnection before each 
stage, an identity permutation results. Hence, 
MM, is a favorite memory of P; for O<i<N-l . 
The analyses presented below also hold for delta 
networks which do not employ an a-shuffle inter~ 
connection before each stage. In such networks, 
the straight connections of the switches may not 
result in an identity mapping and hence, the 
favorite memories will be different without any 
change in the actual performance. In addition to 
the assumptions spelled out for the crossbar 
analysis, we make an important additional assump- 
tion for delta networks. Whenever a number of 
requests reach an output line of a switch, a 
request is randomly accepted with an equal prob- 


ability. We will call this as an Equal Accept- 
ance (EA) rule. 


Consider two switches A and B-~ from two 
adjacent stages of the delta network as shown in 
Fig. 7. There can be only one connection from an 
output of switch A to the input of switch B. The 
other ouput lines of switch A will be connected 
to (a-l) other switches of the (itl)th stage. 
The number of output lines being same at each 
Stage, the rate of request remains same for all 
the output lines of a particular stage. However, 
it may vary from stage to stage. Let p be the 
rate of request on an output line of the ith 
Stage of switches. Clearly, Po is the input 
rate of request which is equal to the probability 
of a processor generating a request. Let ms be 
the probability that there is a favorite request 
on an input line to the (itl)th stage, given that 
there is a request on that line. Let my be the 
fraction of p; available due to a favurite 
request at the input of switch A. So, my; , the 
fraction of p that comes from other inputs of 
switch A is (1 - m;)- 


From eqn. (4) , 
Spiel, yerl 
a-l 
for l<i<n (11) 


Py = 1-Cl-py_-ymy_-y)Cl-py-y * 


The rate of request at the output of switch A 
due to a favorite request = m; p; - Given k re- 
quests at an output line of switch A , a request 


is accepted with a probability of i because of 


EA assumption. All other requests are rejected. 
In addition to a favorite request, (k-1) other 
requests arrive at the output line of switch A. 
If there are a total of j requests at the input 
of switch A, the rate of request at an output 
line due to a favorite request is: 


1 j-l,  k-1 i—k 
~ (Pi-e™a-1 1) xe_7 (1 - X4-1)! | where 


_ }Mi-1 
Sia Se 
a-l 
For k varying between 1 to j , the rate of 
favorite requests at an output line of switch A 


is: 


P4-1 4-1 4-1, _k-l ink 
Yo SSS GED xq C-xy-1)4 


l<ck<j & 
Psy Me_ . 
see ea ee NY 
J *i-1 
With Py-] being the probability of request 
generation for (jel) input lines for an axa 


crossbar at the ith stage, 


- a-l j-l a-j Pi- 1™i-1 
Pym = ) (5-1? P]-1 (l-py.y)* 7 —— = 
O<j-l<a-l j “i-l 


1 - n. 
‘ -l 
With x _ Sienna SS 
i-l a-l 4 


Cae 
_ M-1fard) . (2) 


l-m,_ 
Py My {1-C1-pj-} —E})3} 


a(1-m,;_}) 


Using 1' Hospital's rule it can be shown that 
lim ' 
a Seo Pi™Mi Pi-1 * 
requests are accepted if they were favorite. A 
closer look at the operation of delta networks 
(Fig. 7) reveals that m;_, consists of two types 
of favorite requests to switch A. Let m(f)5_1 
be the fraction of M;_4 that consists of 
requests to memory module MM, . m(nf);_ is 
the fraction of Mm; 1 directed towards other 
memory modules, but appears as a favorite request 
to switch A. Assuming the requests due _ to 
m(f);_4 and m(nf);_,share m5 » at the output of 
switch A, as per their proportion; the fraction 


This means all the 


of m; due to m(f);_4 is ae eran ms and the 
e 1 ‘ m(nf)5_4 ' 
fraction of ms due to m(nf );_4 is ————— +> mj. 


ms] 


Out of the non-favorite part of mi a= 


i 
fraction of requests will be directed towards 


(a-l1) other outputs of switch B (Fig. 7) and 
the rest will appear as a favorite request. 


N 

sh © 
In a delta network, a fraction of N : 

i 

of this non-favorite rate of request appears as a 
favorite request to a switch at the (itl)th stage 
for 1i< i < n-i . 
rate of request m! , at the output line of 
switch A, consists of both favorite and non- 
favorite requests to switch B . The request rate 
is equally directed to all the a output lines 


of the switch B . Out of this, 2% is the 


fraction of the request rate that is directed to 


L 
favorite memory MM _— and (i - a) is the 


i a 

fraction of request rate, directed to other 
memory modules but appears as a favorite request 
to switch B. 


Hence, at the input of the (i+l)th stage, 


m(f)._ i _ 
mE) gc Se iat emt: (13) 
Mm. N 
i-1 
S 1 
i+l ~ m(nf )._ i. 
m(nf); = e | ivl . mt 4+ (i -a ) my 
No -1 my-y 
i 
. (14) 
and 
m, = m(f); + m(nf); for 1 <i <n-1l. (15) 


At the input of the first stage, mf), is the 
probability that a processor requests its 
favorite memory given that it generates a 


Again, the fraction of the. 


(16) 
m(nf ), = © (1 - m(f),). 


With given values of p, and mf), » pys and 


request. Then, 


| 2 
Z 


m,;s can be computed recursively for 1< i< n-l. 


The rate of request at the output of the final 
stage, Py, » decides the BWe of an NxN delta 
network. | 


Favorite Analysis of Omega Networks 


An Omega network [5] is a special case of 
delta network for N = 2". With a= 2 , the 
above set of equations can be simplified as below 
for l<i<n-l. 


N 
7-1 
m(nf ), = - 5 (1-m(f£),) and n= m(f£) + m(nf),- 


(13) 


Py = ICl~py-ymy-y) C-py-y + Py-1 * My-1) = 19) 


me ee ee LZ 2 
my = oy tm a-1 (P41 > Pi-v? 5 Pi-1 ° m_1} (20) 
1 - 
ms = l-m . 
mf) 5-1 gi. | 
m(f); = . mo +R? Wy . (21) 
i=1 
N 
itl m(nf)._ des 
mn), = 2 SE nt r= ea 22) 
N-1 Mi-] S 
9i 
m, = m(f); + m(nf); for 1l<i<n-1l (23) 
and BW, = PN. (24) 


The BW, , obtained for an Omega network, for 
various values of N_ are plotted in Fig. 8 to- 
gether with the BW, for equally likely case. It 
may be noted that for higher values of mf), the 
performance of an Omega network is close to that 
of a crossbar. With a straight connection of 2x2 
switches in an Omega network, a favorite case 
corresponds to an identity permutation. If most 
of the time an identity permutation is desired, 
less conflicts will occur which will give rise to 
an increased Bandwidth. The above set of equa- 
tions, derived for Omega network, also holds good 
for MINs like Indirect binary n-cube, Generalized 
cube and Base line networks, as long as there is 
one and only one path from a processor to a 
memory module. The favorite memories may be 
different depending on the permutations obtained 
when all the switches are straight connected. 
Fig. 9 shows a variation of p,;'s and m,'s at 
various stages of a 1024x1024 Omega network. 
When i=O , it represents the input side and 
i=10 represents the output side of the Omega 
network. Although Py reduces with increase in 
i because of more conflicts, mM; goes on 
increasing. This means that for largé i , the 
rate of request at the output is mainly due to 
the favorite requests. The limiting value of my 


is unity. 


CONCLUSIONS 


Crossbar and Delta networks were analyzed in 
this paper for equally likely and favorite memory 
cases. Equally likely is shown to be a special 
case of the favorite memory analysis. With 
favorite memories, the Bandwidth is much higher 
because of less conflicts. The Delta networks 
perform close to crossbars for favorite memory 
cases, thus increasing the cost effectiveness. 
In a multistage interconnection network, the rate 
of request at a stage (p;) reduces with 
increase in the stages, but the rate of favorite 
request goes on increasing, being limited to 
unity. 


The analysis has been restricted to NxN delta 
networks because of the complexity involved. The 
analytical results match with those obtained in 
simulations. 
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Abstract 


The throughput of unbuffered delta networks 
is related to the arrival rate by a quadratic re- 
currence relation. Lower and upper bounds on the 
solution of this recurrence relation are derived 
in this paper. 


Iwo approaches for improving the throughput 
of unbuffered delta networks are discussed in this 
paper. The first approach combines multiple delta 
subnetworks of size NXN each in parallel, to ob- 
tain a network of size NXN. Three distribution 
policies, used to distribute the incoming packets 
between the subnetworks, are discussed in this pa- 
per and their effect on the throughput is investi- 
gated. 


The second approach replaces each link of the 
simple delta networks by K_ parallel links 
(K=2,4,...). The throughput of these networks 
is analyzed and one possible implementation for 
the crossbar switches to be used in these networks 
is discussed. The throughput of such networks 
with four parallel links is almost equal to the 
throughput of crossbars. 


1. Introduction 


Delta networks have been considered frequent- 
ly for processor-memory and processor-processor 
interconnection in modular computer systems such 
as SIMD, MIMD and Data Flow Machines |1, 3, 7, 8, 


9, 11, 12, 14]. An NXN (N= 2") delta network 
can be constructed from basic switches of size 


BXB (B = 2°), each capable of connecting its in- 
puts to any of its outputs (see Figure 1.1). 


The network n/b stages 
n-b 


1,2,...,n/b) and each stage has 2 basic 
Switches. The outputs of switches in all stages, 
except the last, are connected to the inputs of 
Switches in the next stage by the shuffle connec- 
tion or one of its minor variants. The network 
shown in Figure 1.1 is an 8X8 delta network con- 
structed from 2X2 basic switches. The N inputs 
of the switches in the first stage and the N out- 
puts of the switches in the last stage constitute 
the inputs and the outputs of the network. A 
truncated delta network is obtained by deleting 
one or more stages from a regular delta network. 


has (numbered 


The modules at the network inputs generate 


fixed sized packets to be transmitted over the 
network. The arrival of packets at the network 
inputs are independent and identical Bernoulli 


process with parameter Koa (the arrival rate). 


These packets are directed equiprobably to all 
network outputs. 


All the switches in the network are synchron- 
ized by a single clock. A connection between two 
switches which is capable of carrying one packet 
in each clock cycle is called a link. 


| The switches in a buffered delta network have 
internal buffers to temporarily store an incoming 
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packet that cannot be forwarded in the current cy- 
cle. Unbuffered delta networks have no_- such 
internal buffers. In this paper we will investi- 
gate the pertormance of unbuffered delta networks 
only. 


The network control is decentralized and each 
switch in the network operates autonomously. In 
addition to data, each packet carries its destina- 
tion address. A switch in stage i (1 <i < n/b) 
uses bits DeLGieL)b 41) °° Oe lixb] of the desti- 


nation address 


bybo-+-bo yp) 


ate output port. These bits are called the con- 
trol bits for stage i. The operation of delta 
networks is described in detail in [12]. 


(expressed as an binary number 
to route the packets to the appropri- 


Packets arriving at two distinct network in- 
puts may require the use of a common link between 
two stages. Since only one packet can use that 
link in a clock cycle, one of the packets wiil be 
ignored in the current cycle and resubmitted at a 
later time. Because of such conflicts there is a 
degradation in the throughput (number of packets 
transmitted/unit cycle) of the network. 


The performance of unbuffered delta networks 
has been investigated by Patel{12] and Dias and 
Jump [5, 4]. The throughput of an unbuffered del- 
ta network has been expressed as a quadratic re- 
currence relation [12]. Unfortunately, this re- 
currence relation fails to show the dependence of 
network throughput on the number of stages in the 
network, basic switch size, and the arrival 
rate. 


L 
tne 


Kruskal and Snir provide asymptotic solutions 
for this recurrence relation [10]. In this paper 
we show that one of these solutions is a strict 
upper bound on the performance of delta networks. 
We also derive a strict lower bound on the perfor- 
mance of delta networks, which is much more accu- 
rate than the upper bound for networks constructed 
from 2X2 switches. Both these bounds incorporate 
the dependence of network throughput on the number 
of stages, basic switch size and the arrival rate. 


The performance of unbuffered delta networks 
can be improved by either using multiple delta 
subnetworks in parallel as shown in Figure 1.2a 
[8], or by replacing each link in the delta net- 
a3} by multiple links as shown in Figure 1.2b 
13]. 


In the first approach, various policies can 
be used for distributing the incoming packets 
between the subnetworks. Some of these are dis- 
cussed in this paper, where the effect of distri- 
bution policy on the performance of the network is 
investigated. 


The throughput of multiple link delta net- 
works, constructed from 2X2 switches, can be ex- 
pressed as a set of coupled nonlinear recurrence 
relations [10]. These recurrence relations again 
fail to show the dependence of throughput on the 
number of stages in the network, the switch size 
and the arrival rate. In this paper we analyze 
multiple link delta networks constructed from 
larger switches. The throughput of these networks 
is expressed by coupled nonlinear recurrence rela- 
tions. An approximate solution with a simple 


functional form 1s also derived for the 


throughput. 


The above mentioned approaches also improve 
the fault-tolerance of the network. In the first 
approach, only one correctly functioning delta 
subnetwork is required to allow communication 
between any input, output pair of the network. In 
the second approach, only one out of each set of 
links connecting two particular switches, is re- 
quired to function correctly. 


In section two of this paper we will estab- 
lish fairly tight lower and upper bounds on the 
throughput of simple delta networks. These bounds 
have simple functional forms. In section 3 dif- 
ferent techniques for combining multiple delta 
networks in parallel are considered and the im- 
provement in throughput achieved by the use of 
different distribution policies is compared. In 
section 4 the use of multiple links is investigat- 
ed. Switch implementations for supporting multi- 
ple links are described. The throughput of these 
networks is compared with the throughput of 
crossbars. 


In sections 3 and 4, basic switches of size 
2x2 only have been considered to keep presenta- 
tion simple. However, the results can be easily 
generalized for B>X<B switches. 

Performance of Unbuffered Delta Networks 


— SE ES ATT 


The following expression for the throughput 
of a BXB crossbar switch was derived by Patel 
[12]. If the arrival of packets at the inputs of 
a switch are independent and identical Bernoulli 
process with the same arrival rate x. then the 


arrival of packets at each output (output process) 
is a Bernoulli process with the parameter Xout 


(output rate). X is related to X,_ as follows 


out 


x. = 2 = (ies fay (2.1) 
in 


out 


The output processes at different outputs of 
a switch are identical but not independent. In an 
NXN delta network constructed from BX<B switches, 
the output rate of switches in stage i is denoted 
by Xi The arrival rate at the input links of the 


network, xX. ls equal to Xo» the arrival rate at 


the input of switches in stage l. 
of the network, Xout? 


shown by Dias that the arrival of packets at the B 
inputs of the same switch in any stage are identi- 
cal and independent Bernoulli processes [6]. 
Therefore the output rate x. of stage 1 


(1< i<n/b) can be expressed by the quadratic 
recurrence relation 


The throughput 
is equal to */b° It was 


B 


x = 1 - C1-x,_)/B) (2.2) 
= = < . < 

AO. “in Sout *n/b? and ri =e 

Unfortunately, this recurrence relation pro- 


vides no insight into either the functional form 
of Xout and its dependence on xo? N and B or the 


upper and lower bounds on x, . Such functional 


ut 
forms or bounds would give a better idea of the 
network throughput and allow us to compare the 
throughputs of various networks without resorting 
to computationally intensive or graphical tech- 
Niques. The upper and lower bounds on x, are 


derived as follows 


Define y, = 1 / x, for 0 < i < n/b 


I] 


then 
‘ ae ssh etude gaa a oe te 
itl 1 bs (1-x,/B)” ay 
2 x 
(GE) -QG) 
2/\ B 3)/\B -~= 
sa a A (2.3) 
x. x 
B\j_2}) _ /B\f_2 
«(ME - OG) + 
therefore 
lim _ Bol 
a Viel Yi 2B 
= = _ Berl 
Def ine Jaman ar yi = 
then 
hee COR (He = Cl<e.jB)) 
eke 1 1 1 
i eC Cian? )98 
1 1 
(2.4) 
and 
; i-l 
= i (B-1) 7 
YG -_ Yo + 2B + ee ej (2.5) 


Let n. and d. denote tne numerator and the denomi- 


nator in the expression for e. in equation 2.4. 

x é x 2 

B 1 B 1 

= * a — —}|} — 

Ser Pee a (?)(2) +4(3)(2) 

4 
B\ (i 

(°)(3) +... +4 last term 


where the last term is { B(x, /B)>* - (x, /B)°} iff 


(2.6) 


B is odd and it is ((x, /B)”) otherwise. Since 


each term within the braces 18 a positive quantity 
(because Xs is less than 1) we have tne inequaisty 


de Ope = GET) RS (2.7) 
L 1 1 


Similarly, n, can be written as 


n, = 2B¥x, — (2B +(B-1) x.) x - 1(2)( 


ah La) - EG) Je... 


where the last term is {(x, /B),} itf B is odd and 


(2.8) 


it is (B(x, /B)>* - (x, /B)”} otherwise. 


Again, each term within the braces is positive and 
therefore 


B\ (*i ; B\ (*i ° 
e < <= - e oo 
n, < 2Bx, (2B +(B Dap fy (2)( 3) + (3) (4) | 
simplifying this expression we have 


B 
3 |pe-1 _2(2) 4 
6B e = 


since O<x, <1 


n <x. 
; Peas Di 


(2.9) 


using the bounds for n. and d5 we get the upper 
bound for e. 


B 
Be -1 2 
OB B (2.10) 


|A 


ky. - (B- 
2B¥y. (B- 1) 


If x5 is positive then both n. and d. are po- 


sitive (follows from equations 2.7 and 2.9), and 


therefore Xa] and e5 will be positive too. From 
this lower bound on e. the following lower bound 


on y; follows easily 
¥; 2 Yq + i(B-1)/2B (2.11) 
In equation 2.10 the occurrence of Y; in the 


denominator can be replaced by the lower bound for 
y and the following inequality is obtained 


B 
Be - 1 2(?) 
Be-1 | “\4) 


6B 3 


{= 2B¥*y, + (B-1)*i - (B-1) 


Simplifying this inequality we have 
2 


B - B +2 
e. < 2 
4BLi + 2y) B/(B-1) - 1] 
Therefore 
i-l 2 
j=0 7 4B 


(2.12) 
3 (B-1) 
log, (2By)-28B +2) *? 


Denote the right hand side of the above inequality 
by E.- Thus, the upper bound on ys is 
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ae 


y; S Yo + i(B-1)/(2B) + ES (2.13) 


The upper and lower bounds for x; are ob- 


tained by inverting the lower and upper bounds for 
Yi: Thus, we have the result 


23 
x 


> 2B 
2By,) + (B-1)i rg 


2By, + (B-1)i + 2BE. 
(2.14) 


In Figures 2.la-2.ld the throughput of delta 
networks, obtained from recurrence relation (2.2), 
1s compared with the lower and upper bounds given 
by equation (2.14). For low arrival rates the 
bounds are much more accurate than for high ar- 
rival rates. For networks constructed from 22 
Switches, the lower bounds are within 5% of the 
actual throughput and the upper bounds are witnin 
10% of the actual throughput (for x ij}. 


For larger switch sizes (sizes > 44) the 
lower bounds are less accurate than the upper 
bounds. The upper bounds are stiil witnin lvu& of 
the actual throughput. 3 


Connecting Delta Networks in Parallel 


Three techniques for using K = ok delta sub- 
networks or truncated delta subnetworks (con- 
structed from 22 switches) of size NXN each in 
parallel, to obtain a network of size NX<N, are 
discussed in this section. These techniques 
differ primarily in the distribution policy used 
to distribute the incoming packets between the 
subnetworks. 


The first technique is to connect the ia in- 
put of the network (0 < i < N-1) to the i= input 
of each delta subnetwork through a 1l~-to-K demulti- 
a output of each delta 
subnetwork is connected to the ake output of the 
network through a K-to-l multiplexer (see Figure 
1.2a). The demultiplexers forward an incoming 
packet to any of the K subnetworks equiprobably. 
If multiple requests arrive at the: input of a mul- 
tiplexer in the last stage, one of them is select- 
ed equiprobably and is forwarded to the output of 
the network. This network is called a Randomiy 
loaded parallel delta network (Rn). 


plexer. Similarly the i 


If xX.) is the arrival rate at the input of 


the network, then the arrival rate at the input of 
each delta subnetwork, Xp» is equal to XK, /K- 


The output rate at the output links of each 


delta subnetwork, x can be obtained from re-. 


currence relation (2.2). 


output of the network, x out? 


Ihe output rate at the 
igs given by the ex- 


pression 


a Sen Gee ee 
n 


uk (3..1) 


If x) and x are the upper and the lower 


bounds on x then the upper and lower bounds on 


Xx a 
out oe 


< 
Ss. Ade a2 


1-( 1-x, )* I-C1-x, is (3.2) 


In the technique described above, a packet is 
blocked within the subnetwork witn probability 


l-x. If every packet arriving at an input of 
the network is forwarded to more than one delta 
subnetwork simultaneously, then the probability 
that all copies of the same packet are blocked 
within the subnetworks is expected to be much less 
than I-x. The reduction in blocking of packets 


within the subnetworks will in turn increase the 
throughput of the network. The second technique 
for combining multiple delta subnetworks in paral- 
lel, utilizes this fact. In this technique the 
packets arriving at the inputs of the network are 
forwarded to all the subnetworks. This network is 
called a Multiple loaded parallel delta network 
(Mn). Since the throughput of this network could 
not be determined analytically, simulation tech- 
niques were used. 


The last technique is to demultiplex the in- 
coming packets between K truncated delta subnet- 
works according to the destination address of the 
packet. Each subnetwork has n-k stages and the 
control bit for for stage i of the subnetwork is 


Daa’ The subnetworks are numbered 0 through K-l. 


The incoming eer is forwarded to subnetwork j 
iff b) b, oo eb, the first k bits in the binary 


pee are of the destination address, 
also the binary representation of j. 


are 


All the K outputs of the Pail subnetwork, 
whose binary representations differ only in the 
first k bits are connected to one K-to-l multi- 
plexer (see Figure 3.1). The output of this mul- 


tiplexer is the network output with binary 
representation b,b.- ‘ bo, 41° 42° °° Op? where 

O41 Kan On are the common bits in “he binary 
representation of these subnetwork outputs and 


ie sae is the binary representation of j. This 


network is called a Selectively loaded parallel 
delta network (Sn). 


If X. 

in 

the network, then the arrival rate at the inputs 
of the subnetworks, x), is equal to xX. nk The 


output rate at the output of each enineaped delta 
subnetwork will be x which can be obtained 


is the arrival rate at each input of 


n-k?’ 
from recurrence relation (2.2). The output rate, 
Xout? at each output of the network is given by 
the expression 
he - - K 
X out 1 (1 Xk? (3.3) 


The upper and lower bounds on X can be ob- 


out 


tained by using the upper and lower bounds of Xk 
in the above expression. 


In Figures 3.2a and 3.2b the throughputs ob- 
tained by using different distribution policies 
are compared. The performance of Sn is better 
than that of Rn because the subnetworks in Sn have 
fewer stages and therefore fewer packets are lost 
due to collisions in the subnetworks. The perfor- 
mance of Mn was expected to be better than that of 
Rn, because in Mn all possible paths between an 
input and an output are tried simultaneously. 
However, each subnetwork in this situation is more 
heavily loaded and the number of collisions in the 
subnetworks are greater. The increased number of 
collisions almost offsets the advantage of using 
multiple paths simultaneously. 


The throughputs obtained from a simple delta 
network, from the use of multiple links (to be 
discussed in the next section), and from an ideal 
crossbar [12], are also shown in this figure to 
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4. 


iliustrate the relative advantage of using the two 
approaches discussed in this paper. 


Using Multiple Links 


In this section we propose the use of delta 
networks with multiple links. In an NX<N delta 
network constructed from 22 crossbar switches, 


each crossbar switch can receive up to K = 2 
packets at each of its input ports and it can fo.i- 
ward at most K packets to any output port. Figure 
1.2b shows an 8X8 delta network implemented from 
2x2 switches for K=2. To connect tne input port 
of a switch to the output port of another, K in- 
dependent links are used (since each link carries 
only one packet in a clock cycle). The input 
ports of switches in the first stage of the net-— 
work receive packets only on one of the K links, 
which is the input link of the network (remaining 
K-1 links are unused). The packets on the K links 
of the output are multiplexed on a single link 
which is an output link of the network. A network 
with K parallel links is denoted by De: 


Figures 4.la and 4.lb show one possible im— 
plementation of a switch for Dy and Di The 


operation of the switch for D, 1s described next. 


The switch implementation can be easily general- 
ized for arbitrary values of K. 


contains two banks of four 


l-to-2 demultiplexers, labeled the °U° bank and 
the “L” bank. The inputs of the “U’ bank demulti- 
plexers receive packets from the upper input port 
and those of the “L” bank receive packets from the 
lower input port. The outputs of the demulti- 
plexers are labeled “0° and “1”. The demulti- 
plexers are followed by four 4_input sorters, la- 
beled uu, ul, lu and 11 respectively. The “0° 
“1° outputs of the demultiplexers in the ” 
are connected to the uu and ul sorters, and tnose 
of the “L” bank are connected to the lu and ll 
sorters. The sorters are followed by two banks of 
four 2-to-l multiplexers, which are again labeled 
the “U° bank and the “L” bank. The outputs of uu 
and lu sorters are connected to the multiplexers 
in the “U° bank, and the outputs of ul and 11 
sorters are connected to the multiplexers in the 
“L” bank. The outputs of the “U" and “L” multi- 
plexers form the upper and the lower output of the 
switch. The demultiplexers in the switch forward 
the incoming packets to their “0” output it tney 
are directed to the upper output of the switch and 
to their “1° output otherwise. The sorters move 
the x packets arriving at their inputs (0 < x < 3) 
to their outputs labeled 0 through x. The sorting 
network is constructed from Bitonic Sorters [2]. 


Each switch in Dy 


In a switch for D,> each sorter contains s1x 


switching elements shown as boxes with arrows. If 
exactly one input of an switching element has a 
packet on it, the packet is forwarded to the upper 
output port it tne arrow in the box points upwards 
and to the lower output port if it points down. 
If both the input ports of a switching elexent 
have packets on them, they are both passed 
straight through. 


If the outputs of the two sorters connected 
to the same multiplexer bank have a total of K or 
fewer (more than K) packets, the multiplexers in 
the bank will forward ali the fonly K of these) 
packets to the switch output. 


We wil derive the expression for’ the 
throughput of a 2X2 crossbar switch with K links 
on each input and output port. This result will 
be used to derive the throughput of a NX<N delta 
network constructed from such switches. 


If j packets arrive at the K links of an in- 
put port with probability Kf)» which is in- 


dependent of the arrival of packets on the second 
input port, then the probability of finding m 
packets on an output port is given by the expres- 
sion 


K K ease ae 
Xue = FS KG KG (*) gets? 
out . . : Ln in nm 
1=0 j=m-i 
i >0 
Ja for m = 0,1,2,...,K-1 
K K eee oe 
Aas 2 TO Gye 3 (4) 
out 7 . : 1n 1n nm 
1=0 j=K-i m=K 
(4.1) 
In an NXN delta network made up of such 


switches, let x, (m) (O <m<K), denote the proba- 


bility of finding m packets on the output of a 
switch in stage i (for 1 <i <n) and let Xp (m) 


denote the probability of finding m packets at the 


input of a _ switch in stage 
( ; x.(m) = 1, for 0 < ix n). 
m=0 7 | 
If Xin is the arrival rate at every input of 

the network then 

x, (0) =] - K. 

x) (1) ~ ae 

X62) = X9(3) =... = X 9 (K) = 0 


Since packets arrive at the two inputs of the 
same switch in any stage independently (the rea- 
sons for this are cited in section 2), the proba- 
bilities of finding m (m<K) packets at the output 
of a switch in stage s (1 <s <n) can be ex- 
pressed by the following coupled recurrence rela- 
tions 

K K Spe Sing 
_ ; .\ (itj\,-(it+j) 
zo 7 = : : x.) x, 1 C( m )2 


1=0 j=m-1i 
j 20 
K K Se ok eee 
z= 5 F x Ge Gres ee 
i=0 j=K-i m=K 


(4.2) 


The values of x (0),x (1),...,x (K) can be ob- 
n n n 


tained by solving these recurrence relations [10]. 
The output rate at each output of the network, 
Xout wil then be given by 


Xout Shia x,,(0) 
If the modules connected to the output links 
of the of accepting K 


(1 < K <K) packets in each clock cycle, then the 
throughput of the network will be given by the ex- 
pression 


network are capable 


x (i) *MIN{i,K} 
nN 


In this case the K-to-l1 multiplexers, at the 
outputs of each switch in the last stage, must be 
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replaced by K-to-K concentrators. 


: As mentioned in section 2, the coupled re- 
currence relations (4.2) do not specify the func- 


tional form of Kut or its dependeuce on XS and 
n. To find such a functional form we assume that 


for each stage i between the ha and the last 
stage, x. (3) specities a binomial distribution 


with mean K*p; Ls, Ce; 
+) = (¥\,.J(1-p.)¥75 
x, (4) (;)P: (1-p,) 


The first k stages can be ignored since there can 
be no conflict in these stages. 


Due to the conflicts in the i“ stage Pi 4) 1s 
less than p., but x. 4 6d) (O< 3 <n) is still 


assumed to specify a binomial distribution. It 


can be shown that 


PL = k,/K 
and 
2K 
oe sf Be 
Xue (3 2 


The following recurrence relation holds for the 
values of p,s (k <i <n) 


j 2K 
_ - 2K Pe foe 
= j 2 2 
(4.3) 


Here p,/2 is the probability that a link at the 


oa 


- ey {b 
Pi +l MaX {j,K) K 


input of a switch in stage i (one of the K links 
at each input of a switch) carries a request which 
must be forwarded to a given output port. This 
recurrence relation can be simplified to 


K+1 
Pp. 


+ high order terms 


By deleting the high order terms of P; and denot- 


K+l1 


ing the coefficient of P; by Q in the above 


equation, the following recurrence relation is ob- 
tained 


K+l 
= — * 
Pian ~ PE PA 


The forward difference function of this ditference 
equation is used as differential operator to ob- 
tain the following differential equation 


d K+1 
= * 
ai Q*p 


The solution of this differential equation at in- 
teger points in the domain will be an approxima- 
the solution of the difference equation, 
we have the following approximate solu- 
equation (4.3 


tion for 
and thus 
tion for 


(4.4) 


The analysis presented above can be general- 
ized to obtain the throughput, X out? of NXN delta 


networks constructed from BX<B crossbar switches, 
where K parallel links are used for each connec- 
tion between two switches. The actual throughput 
is obtained from the set of coupled nonlinear re- 
currence relations listed below. In these equa- 
tions, I denotes the total number of packets ar- 
riving at all the inputs of a crossbar switch and 
<1gstys+eesdg_ 1? is a partion of I, which denotes 


the number of packets arriving at each switch in- 
put. 
Recurrence relation 
p ,(m) = 
BK B-1 
2 I] 
Lpsdyseeestp n=0 


T=1ptij)+...t1, )7m 


x14) (ae Cay = 


for m < K and s = 1,2,...,n/b 


p ,(K) = 
BK B-1 I 
> I x {6i,? Se Gea 
in yipyecegdl n=0 7° 7 eK ™ 
Tortpocrcrotpy 
eee 
for s = 1,2,...,n/b 


Initial Conditions 
x,(0) = 1-X. 
0 in 


x, (0) = X, 
x, (0) = 


! 
al 
r~ 
© 
w 

I 
Sd 

i 


Throughput 


A Sak ee 


Similarly, by making the binomial approxima- 
tion, the following equations are obtained for the 
throughput 


ae) 
~ 
| 
“°) 
i=) 


Figures 4.2a and 4.2b show the throughput of 
network D,- The actual throughput obtained from 


recurrence relation (4.2) and the throughput esti- 
mates obtained from equations (4.3) and (4.4) are 
shown together. Figures 4.2c and 4.2d show the 
same for D,- The estimates obtained by equations 


(4.3) and (4.4) are within 10% of the actual 
throughput for networks of sizes less_ than 


2 ; 

2 > 9¢.249 Equation (4.4) turns out to be a very 
good approximation for the nonlinear recurrence 
relation (4.3). 


From Figure 3.4 it is clear that the improve- 
ment in throughput obtained by replacing each link 
of a simple delta network by K parallel links is 
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greater than that obtained by using K delta sub- 
networks in parallel regardless of the distribu- 
tion policy. 


The throughput obtained by using four links 
in parallel is very close to the throughput of a 
crossbar. 


>. Conclusion 

Fairly tight lower and upper bounds were 
derived for the throughput of unbuffered delta 
networks. These bounds have simple functional 


forms and they illustrate the dependence of net- 
work throughput on the arrival rate, network size 
and basic switch size. 


Two approaches for enhancing the performance 
of delta networks were discussed. For networks 
obtained by combining multiple delta subnetworks 
in parallel, three distribution policies were pro- 
posed for distributing the incoming packets 
between the subnetworks. The effect of the dis- 
tribution policy on the throughput of the network 
was investigated. 


Then, networks obtained by replacing each 
link of a simple delta network by multiple links 
were considered. The throughput of these networks 
was analyzed. One possible implementation for the 
basic switches to be used in these networks was 
described. 
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Figure 1.2b : An 8X8 delta network with double 
links (D, network) 
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Figure 1.1 : An 8X8 delta network constructed 


from 2X2 switches. 
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Figure 1.2a: An NXN network constructed from 
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Figure 4.lb : A 2X2 crossbar switch for D, 
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EXPANDING AND CONTRACTING SW-BANYAN NETWORKS 


Doug DeGroot 
IBM Thomas J. Watson Research Center 
P.O. Box 218 
Yorktown Heights, New York 10598 


Abstract 


SW-banyan networks are one of the most promis- 
ing class of multistage interconnection networks. 
Their advantages include simplicity of control, par- 
titionability, modularity, and expandability. Most 
SW-banyan_ interconnection networks that have 
been studied have been strongly rectangular nxn 
networks, constantly growing (expansion) net- 
works, or constantly shrinking (concentration) 


networks. A class of SW-banyans called expanding 
and contracting SW-banyans is formally defined. 


These networks are shown to offer certain signif- 
icant performance benefits over other multistage 
SW-Banyans. Because they require more 
hardware, they are more costly than certain other 
rectangular SW-banyans. However, they offer sig- 
nificant performance advantages, and thus may be 
suited for high-performance environments. They 
retain the advantage of n log n cost functions. 
More importantly, they retain the “unique-path" 
property. 


INTRODUCTION 


It is clear that interconnection networks com- 
prise a major architectural component of large, 
highly parallel computers. Although many net- 
works have been studied for their suitability to 
this environment [Siegel, Lawrie], one type receiv- 
ing much current attention is the "unique-path” 
multistage network. Given any two resources con- 
nected to opposite sides of the network, there is 
always one and only one path through the network 
between them (thus the name "unique-path"). The 
number of stages in these unique-path multistage 
networks is usually O(log n), with n= switches 
(crosspoints) per stage, yielding a cost function of 
O(n log n) (as opposed to O(n?) for crossbars). 
The Omega network [Lawrie2] is perhaps the best 
known example of a "“unique-path" multistage net- 
work. Unique-path networks are in general much 
easier to control than multiple-path networks. 

Unfortunately, most of these unique-path net- 
works also contain only partial interconnectivity 
capabilities. Thus, given any two resources to be 
interconnected, if other interconnections presently 
exist, the two resources may not be able to be 
interconnected due to blockage by the other 
already active interconnections. This absence of 
‘full interconnectivity can cause serious degradation 
in system performance if interconnection blockages 
cannot be held to a minimum by either the operat- 
ing system or the user application. Clearly, the 
less blockage inherent in the network the better. 
One large class of blocking, unique-path networks 
is the class of SW-banyan networks. It is shown 
below that expanding and contracting multistage 
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SW-banyans have less blockage inherent in them 
than other types of multistage SW-banyans. But 
what is just as important, they are also shown to 
possess the unique-path property. 


BANYAN INTERCONNECTION NETWORKS 


In this section a very large, general class of 
multistage interconnection networks called banyans 
is described [Lipovski]. SW-banyans, a proper 
subset of banyans, are a particularly attractive, 
cost-effective, modular type of banyan that can be 
recursively synthesized from smaller SW-banyans. 
A number of single-valued functions are defined 
which describe certain structural and topological 
features of SW-banyans. 


Banyans 


Banyan networks, named for an East Indian and 
Hawaiian fig tree, are defined in terms of their 
graph representation. A banyan (or banyan 
graph) is a Hasse diagram of a partial ordering in 
which there is one and only one path from every 
base to every apex. A base is defined as any ver- 
tex with no edges incident into it; an apex is any 
vertex with no edges incident out of it; all other 
vertices are called intermediate vertices. Some 
examples of banyans are shown in Figure 1. In 
these diagrams, bases are at the bottom and apexes 
are at the top. In the following illustrations, the 
banyans will be drawn as undirected graphs. 


An L-level banyan is a banyan in which every 
base-apex path is of length L. In an L-level 
banyan, there are L levels (stages) of arcs but Lt] 
levels of nodes. By convention, all following 
banyan illustrations will be numbered baseward, 
with apexes at level 0 and bases at level L. In an 
L-level banyan, arcs exist only between vertices in 
adjacent levels. 


In a banyan, the spread of a vertex is the num- 
ber of edges incident out of the vertex; the fanout 
of a vertex is the number of edges incident into the 
vertex. If every node level of a banyan is such 
that all vertices within the same level have identical 
spread and fanout values, the banyan is called uni- 
form: otherwise it is called non-uniform. In a uni- 
form banyan, the fanout values of the vertices may 
be characterized by an L-component vector F, 
called the fanout vector. Similarly, the spread val- 
ues are characterized by an L-component spread 


vector S. A rectangular banyan is a banyan for 
which F = S. It is shown below that a rectangular 


banyan has the same number of vertices at each 
level. Non-rectangular banyans have F # S. 


If every component of the fanout vector F is 


equal to some constant f and every component of. 


the spread vector is equal -to some constant s, the 
corresponding banyan is called regular; otherwise 
it is called irregular. When f = s, by necessity F = 
S, and the corresponding banyan is both regular 


and rectangular; such banyans are called strongly 


rectangular. An irregular rectangular banyan is 
called weakly rectangular. Figure 1 illustrates 


these concepts. From the above definitions it can 
be seen that a crossbar is a one-level regular 
banyan. 


SW-banyans 


SW-banyans are a particularly interesting prop- 
er subset of L-level banyans. They are especially 
suitable for partitioning and connection networks 
[Goke, DeGroot]. SW-banyans can be axiomatically 
defined [Premkumar]. Recall that a banyan is a 
Hasse diagram of a partial ordering in which there 
is one and only one path from every base to every 


apex. The level x reachability set for any base b, 
O< x <L, is defined as the set of all nodes in lev- 


el x that can be reached by directed paths from 
base b. Similarly, the level x reachability set for 
any apex a, 0 < x < L, is defined as the set of all 
nodes in level x that can be reached by paths from 
apex a. A banyan is an SW-banyan if and only if 
for any two bases b and c, or any two apexes d 
and e, their level x reachability sets are either dis- 
joint or identical, for 0 < x < L. 


Constructing SW-Banyans 


In addition to the preceding 
definition, a novel constructive definition of 
SW-banyans will now be given. This definition 
applies only to uniform L-level SW-banyans, but it 
can easily be extended to include non-uniform, 
non-L-level SW-banyans. It differs considerably 
from previous constructive definitions for 
SW-banyans and SW-banyan -networks  [Goke, 
Goke2, DeGroot]. It can be used to generate regu- 
lar and irregular SW-banyans, and 
non-rectangular, strongly rectangular, and weakly 
rectangular SW-banyans, as well as the expanding 
and contracting SW-banyans presented here. A 
number of single-valued functions can be associ- 
ated with this definition, and they are given below. 
These functions define and describe certain struc- 
tural and topological features of SW-banyans. 


axiomatic 


A uniform one-level 


SW-banyan is simply a 
crossbar. A two-level uniform SW-banyan can be 
constructed as_ follows. Consider any mxn 
crossbar, and select t of them. Choose any integer 
k > 0, and construct another SW-banyan as 


follows. Take the first (leftmost) apex of each of 
the t crossbars and connect them to k new nodes. 
Take the second apex of each crossbar and connect 
each of these apexes to k other new nodes. Con- 
tinue until all apexes have been connected to 
groups of k new nodes. Figure 2 illustrates this 
procedure. The resulting 2-level network is 
(km)x(tn). Clearly each of the km new nodes has 
one and only path to each of the t crossbars (see 
Figure 2). Further, each crossbar apex has one 
and only path to each crossbar base. Therefore, 
each of the km new nodes has one and only one 
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path to each of the tn bases, and thus the con- 
structed network is a banyan. It is easy to see 
that the constructed two-level banyan satisfies the 
axiomatic definition given above, and thus the 
banyan is also an SW-banyan. Now consider any 
mxn (L-1)-level uniform SW-banyan. Choose some t 
of these SW-banyans and some k > 0 as before. An 
L-level SW-banyan can be constructed using the 
same procedure as above. Because each (L-1)-level 
SW-banyan has one and only one path between each 
apex and base, so will the constructed L-level 
This procedure can be recursively 
applied to construct an SW-banyan with any num- 
ber of levels. There will always be one and only 
one path between every base and every apex in the 
constructed SW-banyan, and it can be shown that 
the axiomatic definition will always hold. Figure 3 
illustrates this process. 


k=2 
pe, 
v 
ae mxn crossbar 
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L levels 
L-1 
levels 
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SW-banyan Topological Features 


In a uniform SW-banyan, all vertices within a 
given level have identical fanout values and identi- 
cal spread values. The fanouts and spreads of a 
uniform SW-banyan are represented by the 
L-component vectors F and S. For example, the 
3-level SW-banyan in Figure 1.c has F = (2,3) and 
S = (2,2). The fanout of every node at level i is 
denoted f(i) for 0 < i < L-1; the spreads of these 
nodes are denoted s(i) for 1< i< L. In other 
words, f(i) is the i+1'st component of F, and s(i) is 
the i'th component of S. For convenience, we 
define both s(0) and f(L) to be 1. The number of 
nodes in any level x of an L-level SW-banyan is 
defined as n(x). The number of apexes is n(0); 
‘the number of bases is n(L). 


In this section, a number of topological features 
of SW-banyans are described. These features are 
used in the following sections to describe expand- 
ing and contracting SW-banyans. Because the 
proofs of the theorems are so easy and obvious, 
and because they have appeared elsewhere in the 
literature ([Goke, DeGroot]), they are omitted 
here. 

Theorem 1: | 

For any level x in an SW-banyan, 0 < x < L-l, 
n(xt1) = n(x)f(x)/s(xt1). 

Theorem 2: 

For any uniform L-level SW-banyan, n(x) < n(x+1) 
if and only if f(x) > s(xtl). Furthermore, 
n(x) > n(x+1) if and only if f(x) < s(xt1). And 
n(x) = n(x+1) if and only if f(x) = s(x*1). 
Theorem 3: 

In a rectangular SW-banyan, n(x) is constant and 
equal to B, the number of bases in the banyan, for 
O<x<L. 

Corollary 3.1: 


If n(x) = n(xtl) for all x, O< x < L-1, then 
F=S. 

Theorem 4: 

Each base of an  SW-banyan S_ reaches 


s(x+1)s(x+2)...s(L) nodes at level x, 

O< x < L-l. 

Corollary 4.1: 

The number of apexes in a uniform SW-banyan is 
s(O)s(1)...s(€L). 


Corollary 4.2: 


Each node at level x, O<x< UL, reaches 
.§(0)s(1)s(2)...s(x) apexes. 

Theorem 5: 

Every apex in ae uniform SW-banyan_ reaches 


f(O0)f(1)...f(x-1) nodes at level x, 1 < x < L. 
Corollary 5.1: 

The number of bases in a uniform SW-banyan is 
f(O)f(1)...f(L). 
Corollary 5.2: 

Each node at level 
f(x) f(x+1)...f(L) bases. 
Theorem 6: 

In. a uniform SW-banyan, 
equals the numbers of bases _ if 
f(O)f(1)...f(L-1) = s(1)s(2)...s(L). 


x, O < x <L,_ reaches 


the number of apexes 
and only if 
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EXPANDING AND CONTRACTING SW-BANYANS 


Most of the multistage interconnection networks 
that have been studied have been nxn multistage 
networks, that is, they have n inputs (apexes) and 
n outputs (bases). Furthermore, they have almost 
always been strongly rectangular nxn networks - 
they have had n switches in every stage of the 
network. In this section, a certain type of 
non-rectangular nxn network is presented. These 
networks are called "expanding and contracting" 
nxn SW-banyan networks. They are shown to have 
certain performance advantages over the prevalent 
rectangular nxn networks. 


From Theorem 6 above, any nxn SW-banyan 
must have f(O)f(1)...f(L-1) = s(1)s(2)...s(L) =n. 
Furthermore, to be rectangular, (that is, to have n 
switches in every’ stage), it must be _ that 
f(i) = s(itl) for O< i< L-1. It should be clear 
that we can take the f(i) and permute them and 
still have their product equal to n. For instance, 
if f(O)f(1)F(2) = n, then clearly f(1)f(0)f(2) also 
equals n. Therefore, given an nxn L-level 
SW-banyan with particular fan and spread vectors 
F and S, another nxn SW-banyan can be derived 
by simply permuting the components of F or S (or 
both). However, unless f(i) still equals s(i+1), for 
Osis L-1, the nxn SW-banyan will no longer be 
rectangular (since F will no longer equal S). 


For an example, consider the 8x8 weakly rec- 
tangular SW-banyan in Figure 4a. It has F and S$ 
both equal to (2,4). Because F = S, n(x) is con- 
stant and equal to 8 for O < x < 2 (see Theorems 1 
and 3). Now consider what happens when F is 
changed to the vector (4,2). Clearly now F # S, 
and so the resulting SW-banyan cannot be rectan- 
gular. However, we still have that 
f(O)f(1) = s(1)s(2) = 8. So the banyan is still 8x8 
(see Corollaries 4.1 and 5.1). But because 
f(0) < s(1), n(O) < n(1). In fact, n(O) = 8, but 
n(1) = 16. (From Theorem 1, n(1) = n(0)f(0)/s(1). 
And since f(0)/s(1) = 4/2 = 2, n(1) =:2n(0).) In 
other words, there are twice as many nodes at level 
1 than there are at level 0. The corresponding 8x8 
SW-banyan is shown in Figure 4b. Since 
f(1)/s(2) = 2/4 = 1/2, there are half as many 
nodes at level 2 than there at level 1. So level 0 
has 8 nodes (apexes), level 1 has 16 nodes, and 


Figure 4 


level 2 has 8 nodes (bases). This banyan expands 
outward from the top (apexes) and then contracts 
back toward the bottom (bases), somewhat like a 
diamond. ) 


It is easy to see that it is also possible to con- 
struct expanding and contracting mxn SW-banyans 
in which m # n. 


Preserving the Unique-path Property 


It has seemed perplexing to many that a multi- 
stage network can have twice as many nodes in an 
inner level than in the apex or base levels and yet 
still retain the unique-path property, that is, that 
there can still be one and only one path between 
every apex and every base. Figure 4b provides 
visual proof of one example of this possibility. To 
see how the unique-path property is maintained, 
recall the constructive definition of L-level 
SW-banyans given in Section 2.3. A uniform 
two-level SW-banyan can be constructed by select- 
ing t mxn crossbars and interconnecting them 
through km new nodes, for some k > QO. It is clear 
that doing so yields an SW-banyan with F = (t,n) 
and S = (k,m). If t > k, then there will be more 
level 1 nodes than there are level O nodes. If 
n<m, there will be fewer level 2 nodes than there 
are level 1 nodes. However, the recursive con- 
structive definition assures us that the constructed 
network will in fact be an SW-banyan and will 
therefore possess the unique-path property. In 
this way, an expanding and contracting network 
can be constructed, and the unique-path property 
is maintained, as explained above. The above pro- 
cedure can be recursively repeated to produce 
expanding and contracting SW-banyans of even 
greater numbers of levels, with the unique-path 
property being easily proven by induction as 
before. 


A mathematical proof of the unique-path proper- 
ty of SW-banyans has been given in [Bhuyan]. 


Blocking Characteristics 


Most unique-path multistage interconnection 
networks suffer from various types of blockage. In 
this section, a special type of blockage is consid- 
ered. It is initially assumed that some sort of ded- 
icated, non-interfering connections are to be used 
to interconnect apexes to bases, as in circuit 
switched connections, for example. This assump- 
tion is later relaxed. 


When one apex is connected to a base by means 
of a dedicated communication path, no other new 
apex-base connection can be made if that new con- 
nection requires a node or link in use by the first 
connection. The second connection is said to be 
"blocked." This is a direct consequence of the 
unique-path property. Only when the first con- 
nection is undone can the second be made. Certain 
networks are inherently more prone to blocking 
than others. 


In this section, a function is defined which 
gives an indication of the amount of static blockage 
inherent in an L-level SW-banyan. This function 


ae 


allows different SW-banyans to be compared to each 
other. The function simply relates how many pos- 
sible interconnections are rendered impossible (be- 
come blocked) by any single connection made on an 
empty network. Exactly how much-= run-time 
blockage this connection causes depends on many 
factors. 


Consider any single interconnection. It uses one 
and only one node at each level of the SW-banyan. 
The base and apex being interconnected are obvi- 
ously rendered unavailable; but because this would 
be true even in crossbars, their unavailability is 
not considered as blockage here. Consider the 
node in use at level x however, 1 < x < L-1. This 
node can be reached by s(1)s(2)...s(x) apexes. 
Furthermore, this node can reach 
f(x) f(x+1)...f€L-1) bases. But one of these apexes 
and one of these bases are the apex and the base 
in the given interconnection, so they are not con- 
sidered as being able to be blocked (they are 
already in use). For any x, O< x < L, define 
a(x), the number of apexes reachable by a node at 
level x, to be s(0O)s(1)...s(x) (see Corollary 4.2). 
Define b(x), the number of bases reachable by a 
node at level x, to be f(x)f(x+t1)...f(L) (see Corol- 
lary 5.2). Then bp(x), the number of blocked 
paths that pass through the busy node at level x, 
is simply (a(x)-1)(b(x)-1), for 1< x < L-1. We 
define both bp(0) and bp(L) to be zero. The sum 
of all bp(x), for 1< x < L-1, however, does not 
yield the total number of blocked paths generated 
by a single active path, since many blocked paths 
would get counted more than once with this sum. 
To avoid the multiple counting, we define the func- 
tion bI(x) as [a(x)-a(x-1)](b(x)-1), for 
1<x < L-1. This equation counts the number of 
additional blocked paths that are encountered as a 
communication path is followed from an apex down 
to a base. The total blockage created by a single 
connection is then correctly given by the sum of all 
bl(x), for 1< x < L-1. 


For an example, consider several possibilities of 
a 16x16 SW-banyan. The most popular such net- 
works are strongly rectangular ones in which 
either f=s=2 or f=s=4. An expanding and contract-: 
ing SW-banyan in which’ F = (4,2,2) and 
S = (2,2,4) is also considered. These networks are 
illustrated in Figure 5. The values for a(x), b(x), 
and bli(x) are shown for all three. Summing the 
bI(x)'s, it can be seen that the f=s=2 SW-banyan 
incurs a total blockage of 17 blocked intercon- 
nections for each connection made. The f=s=4 
SW-banyan incurs a total blockage of only 9, or 
almost half as few. But the expanding and contract- 
ing SW-banyan incurs a total of only 5. From these 
figures, it would seem that the expanding and con- 
tracting SW-banyan is the best performer of the 
three, followed by the f=s=4 and then the f=s=2. 
These results are consistent with recent studies 
[DeGroot, Malek, McMillen, Bhuyan]. 


For another example, consider a 64x64 
SW-banyan. Using 4x2 and 2x4 nodes, a 64x64 
SW-banyan can be built with F = (4,4,2,2) and 
S = (2,2,4,4), as shown in Figure 6. Level 0, the 
apex level, has 64 nodes, level 1 has 128, level 2 
has 256, level 3 has 64, and level 4, the base level, 
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has 64. For each connection made, this 64x64 
SW-banyan suffers only 33 blocked 
Interconnections. The standard f=s=4 SW-banyan 
suffers 81, and the f=s=2 SW-banyan suffers 129. 
Clearly the expanding and contracting SW-banyan 
offers significant performance gains over other 
SW-banyans. 


It should be noticed that the increased perform- 
ance of an expanding and contracting SW-banyan 
does not come without cost. First, the number of 
stages in an expanding and contracting SW-banyan 
may be more than in other networks, leading to 
increased communication’ delays. In addition, 
expanding and contracting SW-banyans may easily 
require more network switches and wires. 
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However, importantly, the total cost of expanding 
and contracting SW-banyans still grows at the rate 
of only n log n. 


Bandwidth Characteristics 


The above analysis considered all intercon- 
nections to be dedicated, non-interfering con- 
nections, as in circuit switching. This section 


considers the analysis of expanding and contract- 
ing SW-banyans in a packet switching environment. 
Bandwidth is defined to be the expected number of 
memory requests accepted per cycle. To calculate 
the bandwidth, it is necessary to calculate and sum 
the probabilities of an output occurring at each 
base (assuming apexes are the inputs). Bhuyan 
and Agrawal provide a simple recursive function 
for doing so [Bhuyan]. The probability of output 
of a node at network level i is simply 


pi) = 1 = (1 - pli-1)/s009) 7? 

We assume here that p(0)=1 (see [Bhuyan]. for 
other relevant assumptions. The bandwidth of an 
SW-banyan is then simply n(L)p(L). For the 16x16 
SW-banyan then, the f=s=2 version has a bandwidth 
of 7.2, the f=s=4 version has a bandwidth of 8.44, 
but the expanding and contracting SW-banyan with 
F = (4,2,2) and S = (2,2,4) has a bandwidth of 
9.28. Clearly the expanding and _ contracting 
SW-banyan is the better performer in = packet 
switching environments. Similar results are 
obtained for the 64x64 example. It should be easy 
to prove that this will always be the case for 
expanding and contracting SW-banyans. 


OTHER TOPOLOGICAL POSSIBILITIES 


It should be clear from Section 3.1 that with the 
proper choice of t and k at each recursive step of 
the construction of an SW-banyan that arbitrary 
topologies can be achieved. Figure /7 illustrates 
some of the many possibilities. Each has the 
unique-path property. What the advantages of any 
of these topologies are, if any, remains to be 
investigated. 
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Figure 7 


CONCLUSIONS 


It has been shown that nxn_~ multistage 
SW-banyans can be constructed with more than n 
nodes in an internal level. This was not generally 
believed to be_ possible. Because. they are 
SW-banyans, these networks have the unique-path 
property, that is, there is one and only one path 
between every apex and every base. Traditionally, 
nxn SW-banyans are constructed as 
strongly-rectangular SW-banyans. It has been 
shown here that the expanding and contracting 
SW-banyans possess significantly less inherent 
blockage than the corresponding’ rectangular 
SW-banyans. Such networks can be built fram only 
two or three different types of switches. Although 
they require more switches and may result in more 
network stages than some strongly rectangular 
SW-banyans, they retain the advantages of a cost 
function of only O(n log n). As a consequence, it 
seems such networks may prove to be more suitable 
for interconnecting large numbers of system 
resources than the prevalent rectangular networks, 
especially when high performance is a major con- 
cern. 
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Abstract 


The communication costs for parallel versions of two simple 
algorithms used in image processing are compared in packet 
switching and circuit switching formulations. The two algorithms 
are smoothing and histogramming. The histogramming algorithm, 
the recursive doubling algorithm of Stone, is studied over a range 
of processor numbers and pixel intensity resolution. The packet 
and circuit switching properties of the interconnection networks of 
the multiprocessor systems are based on two network architectured 
multiprocessors which are well-documented in the literature, PASM 
and TRAC. Communication based upon circuit switching generally 
gives a somewhat lower communication cost with the advantage 
increasing with pixel intensity resolution. The results of the 
analysis suggest a high utility value for including both circuit 
switching and packet switching functionality in the networks of 
network architectured multiprocessor systems. 


Introduction and Overview 


This paper compares the communication costs for executing two 
algorithms used in image processing on three parallel computer 
architectures. The purpose of the comparison is to evaluate packet 
switching and circuit switching modes of data movement for 
interconnection network based multiprocessors. The two 
algorithms used for the comparison are computation of histograms 
of the intensity values of pixels of an image and smoothing of gray 
level data across the pixels of an image. 


The model for a packet switching architecture is the Partitionable 
SIMD/MIMD (PASM) System for Image Processing and Pattern 
Recognition [Siegel81]. The model for a circuit switching 
architecture is the Texas Reconfigurable Array Computer (TRAC) 
{Sejnowski80]. The third architecture, all processors sharing a 
common bus [Bhuyan82], is given as a baseline for the comparison. 
An analysis of communication costs for the two algorithms 
executing on PASM has been given in [Siegel81]. The results of an 
analysis of the execution of the two algorithms in a circuit 
switching formulation based on TRAC are given here. Space 
limitations preclude detailing of the analysis. 


Communication Analysis for Parallel Algorithms 


The major factors determining communication cost for the 
execution of parallel algorithms on interconnection network (ICN) 
based multiprocessors include: (i) the topology of the ICN and the 
configuration of resources on the ICN, (ii) the mapping of the data 
movement requirements of the algorithm upon the ICN, (iii) choice 
of switching methodology, (iv) the latency and bandwidth 
properties of the ICN, and (v) the unit sizes and the total volume 
of the data to be moved. This paper focuses on the impact of 
switching methodology and data volume on communication cost. 


The choice of packet switching or circuit switching as the mode of 
network data path establishment can have a substantial effect on 
each of these architectural parameters. Packet switching tends to 
give flexibility in topology but fixed unit transfer sizes. Circuit 
switching tends toward less flexibility in topclogy, greater flexibility 
in unit size for transfers, but a longer transfer latency time. Packet 
switching may also introduce bandwidth degradation due to path 
contention while circuit switching may introduce path blockages 
which limit realizable network topologies for all networks short of 
full cross-bars. 
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The measure of communication cost is elapsed time. The 
communication times given herein are reported as number of 
memory cycles. We assume, in order to normalize computation 
costs across architectures, that an integer addition takes a single 
memory cycle and that updating a histogram vector element 
requires two integer additions. The speed-up of a multiprocessor 
over a uniprocessor is the ratio of total execution times, T,,, where 


Tr = Tocomm + Tcomp: All LOG’s in this paper are in base 2 


unless otherwise noted. The data paths of each ICN are taken to 
be one integer word in width. For the multistage ICN’s of PASM 
and TRAC it is assumed that a unit of data moves through one 
stage of the ICN on each memory cycle. 


Definition of Architectures 


Communication cost for execution of the two algorithms is 
compared for three ICN-based multiprocessor architectures. The 
single shared bus architecture (Figure 1) has been characterized by 
Bhuyan and Agrawal [Bhuyan82]. It is a baseline for ICN-based 
multiprocessors. There is no distinction between packet and circuit 
switching in this model of communication. The model for a packet 
switching data movement architecture is PASM [Siegel81]. The 
ICN of PASM connects complete processing elements as shown in 
Figure 2. 


connection 


Network 


Fig. 2. PE~to-PE Configuration 


Fig. 3. Processor-to-Memory Configuration 


The interconnection networks proposed for PASM are the 
generalized cube and the augmented data manipulator (ADM) 
{Siegel79]." ‘hese two networks are optimal for histogramming in 
the sense that all permutations for the algorithm can be realized by 
both networks in a single pass. Thus packet transfers can take 
place without blocking. 


The model for a circuit switching data movement architecture is 
TRAC [Sejnowski80]._ TRAC places processors at the apex nodes 
and memories at the base nodes of its ICN (Figure 3). The ICN of 
TRAC is an SW-Banyan [Premkumar80] with nodes having spread 
of two and fanout of three for its ICN. Processor configurations 
are formed by establishing circuits in the ICN joining processors to 
memory units. Data flow between processors for different stages of 
the algorithms can be realized by dynamically switching memories 
between processor-memory configurations. This network also 
implements trees of circuits joining one memory to many 
processors in which any one circuit may be activated and/or 
deactivated by a single processor instruction. These tree circuits 
are called shared or switchable memory trees. Data flow between 
processors may be implemented using this capability by a sequence 
of circuit activations and deactivations (among the circuits 
following the tree). 


The ICN of TRAC actually implements both circuit switching 
and packet switching but only the circuit switching mode of use is 
modeled in the equations given following. 


The Algorithms and Their Mapping to the Architecture 


Histogramming and smoothing are among the basic operations of 
image processing, although not usually rate determining steps in 
the computations. Attention to detailed parallel formulations of 
major computational steps of image processing such as thresholding 
and edge detection is needed. It is assumed in the analysis 
following that the picture is M*M pixels in size (M=2™) and that 


N (N=2") processors are available. The resolution of each pixel is 
d bits. 


Histogramming Algorithm 


The parallel algorithm for histogramming is the recursive 
doubling algorithm of Stone (Stone75]. The structure of the 
algorithm is shown in Figure 4 for N=8. N partial histograms are 
computed in parallel at level 0. Each partial histogram is a vector 


of length 2». The partial histograms are then added in pairs in 
parallel for LOG(N) stages to complete the algorithm. 
level 3 


level 2 


level 1 


level 0 


Figure 4: Recursive Doubling Algorithm 


for Histogramming 


Partial histograms are shown at level 0 by A’s and vector additions 
by B’s. N/2' transfers of vectors are done between level (i-1) and 
level i. The computation time, TGoysp, for this algorithm under 
the assumptions made here is proportional to Toome = M?/ N + 
2\ LOG(N). 


A Packet Switching Formulation of Recursive Doubling 
Histogramming - Siegel et al (Siegel8i1f have given a thorough 


analysis for the execution of this algorithm on PASM. We adopt 
the results of this study as our packet switching model of recursive 
doubling histogramming. It is commonly the case that further 
steps in the analysis of the image require thresholding so that the 
final histogram vector must be collected in one processor and the 
threshold value distributed. The total communication time for this 
formulation of the algorithm is aa 


Processors * @) C) © & 


Tree circuits = { 7 


Memories = 


Tcoomm = [(LOG(N) + 24) + 2] x LOG(N). 


travel time switch number of levels 
through the setting in the ICN 


ICN time 
A Circuit Switch Formulation of Recursive Doubling 


Histogramming Based on Tree Circuits - Figure 5 illustrates the 
structure of the circuit switched data movement formulation of 
recursive doubling for an 8 processor-8 memory configuration. 


" 7 
1 4 


° 
o,00000 
° 
° 
° 


[«k 
“ 
[=] 


Oo ee ee ae ee ae om em ee ee re 


tree circuits (each circuit has a 
distinct "color") 


et @@ecdeeeee e000 


" ee } normal processor-memory circuits 


Figure 5: Circuit Switching Using the 
Tree Circuit Formulation 


The M? pixels are evenly partitioned among the 8 memories. Each 
processor computes a partial histogram vector and stores it in the 
corresponding memory. The computation is then completed in 
LOG(8)=3 stages of adding partial vectors with the full histogram 
computed by processor 3 and stored in memory 5. The tree 
circuits of Figure 5 implement the successive communication paths 
between levels in Figure 5. The "--—" tree circuits implement the 
data flow between levels 0 and 1 in Figure 4, *ooooo" the data flow 
between levels 1 and 2 and "....." the data flow between levels 2 
and 3. There is a regular pattern of using first the verticals and 
then the diagonals of each type of tree circuit. Each tree circuit 
type has a unique number {called COLOR in correspondence with 
graph theory). LOG(N) colors are required to implement the 
algorithm in this formulation. Path selection (activation and 
deactivation) in all tree circuits of identical COLOR can be done in 
parallel with a latency time proportional to LOG(N)/2. The ICN 
of TRAC can implement the tree circuit pattern of Figure 5 
without blockage. The total communication cost for this 
formulation of recursive doubling histogramming is 


LOG(N) 
Toomm = [LOG(N)l! £ N/2! [LOG(N) + LOG(N)| 
Hb ta Lag 
== (3/2)(N-1) time to latency 
switch all time 


memory with 
identical COLOR 


A Circuit Switching Formulation of Recursive Doubling Based on 
Direct Reconfiguration - Another formulation based on circuit 
switching can be developed by directly reconfiguring the ICN after 
each step (level in Figure 4) of the algorithm to conform to the 
data movement path required at each stage of the algorithm. Each 
configuration step involving establishment of a circuit between a 
given processor ard a set of memories must be done serially. Thus 
use of the tree-circuit based algorithm is faster by a factor of 
LOG(K) where K is the number of COLORs available. 


‘The Smoothing Algorithm 


Smoothing is replacement of the intensity of each pixel by the 
mean of the intensity of the given pixel and its nearest neighbors. 


Packet Switching Formulation of Smoothing - Siegel et al 
{Siegel81] have formulated and analyzed a packet data movement 
formulation of the smoothing algorithm. They show a speed-up of 


about .8N for a 1024 processor configuration. This estimate is 
extremely conservatively based. A greater speed-up is probable. 


Circuit Switching Formulations of Smoothing - A_ circuit 
switching structure for the smoothing operation is suggested by the 
fact that each computation requires only nearest neighbors. 
Therefore if the pixels are stored by columns then a processor will 
need simultaneous access to three columns (say k-1,k,k+1) to 
execute the computations on column k.A realization of this 
representation of the smoothing algorithm is given in Figure 6. 
Extra zero valued rows and columns of pixel values are added to 
each formulation of boundary conditions. The solid lines of Figure 
6 are normal circuits. The dotted lines are tree circuits from which 
leaf-root paths can be selected. Processor 1 computes in sequence 
the smoothed values for the pixels in columns 1, 2 and 3. Processor 
2 will simultaneously and in sequence compute the smoothed values 
for the pixels in columns 4, 5 and 6. P1 and P2 must share access 
to pixel columns 3 and 4. The execution procedure described 
preceding allows this sharing to be accomplished without conflict if 
the required circuits can be established in the network. This two 
processor configuration obviously extends to an N processor 3N- 
memory configuration so long as the memory unit can hold an 
entire column of pixel values. A TRAC-like ICN can realize these 
configurations so long as these restrictions are met. It is also the 
case that the necessary data movement can be realized by 
reconfiguration of normal circuits. This is not the method of choice 
so long as the conditions for a tree circuit representation can be 
met. 
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Figure 6 A Storage Structure and 
Circuit Configuration for 
Parallel Smoothing 


It may be desired to use a degree of parallelism greater than M 
(N>M). Then the columns of pixels must be decomposed into 
vectors of length M/k where N=kM.- In this case the 
establishment of circuits is dependent upon k and may not always 
be possible. A formulation using both circuit switching and packet 
routing capabilities for TRAC has been worked out. The pixels 
appearing at the boundaries created by partitioning of columns 
have their "nearest neighbors" sent to them by packet movement. 
This "mixed® communication mechanism is still of lower cost than 


a pure packet based mechanism. The case N<M (for N=2i, 


m==2)) is handled by assigning multiple (2*) columns to processors. 
This case raises no new problems. 


We _ give here numbers only for the circuit switching 
representation where N=M and data movement is via tree circuit 
activation and deactivation. Then the total communication cost is 
(N/2) LOG(N) (assuming deactivation and activation of all tree 
circuit paths is done in parallel). If N=M=512, then only 
256*9=2304 memory cycles are required for data communication. 


2) 


This is negligible compared to the C*512*512 arithmetic operations 
on the pixels (C>10 and probably C>10?) since indexing must be 
accomplished as well as the addition and division of smoothing 
itself. 


We thus conclude that for smoothing data movement costs will 
be essentially trivial for both packet and circuit switching 
representations. 


Speed-up Analysis and Discussion 


Figure 7 shows the net speed-up versus the number of processors 
for M=-1024 and \=8. There is, in this case, little difference 
between formulations based on different switching strategies for 
moderate numbers of processors. There is the suggestion that 
circuit switching will yield superior performance for large numbers 
of processors. 


Figure 8 shows the speed-up factor as a function of d for 
M=1024, and N=256. The amount of data transferred grows 
exponentially as \. Thus circuit switching data movement shows a 
strong advantage as ) increases since the cost of data movement in 
the circuit switching strategy given here is constant with respect to 
data volume until the capacity of a memory unit is exceeded. 


Smoothing on the other hand shows advantage for packet 
switching since there are cases where a pure circuit switching 
formulation becomes rather complex. 


The bottom lize with respect to parallel histogramming is that 
circuit switching has an advantage resulting from flexibility in the 
unit size of transfers and in stability with respect to algorithm 
parameters but that well-designed architectures should give similar 
performance for small to moderate numbers of processors. 


Circuit switching and packet switching are both extremely 
efficient for parallel smoothing. Packet switching has an advantage 
over circuit switching with respect to application of degrees of 
parallelism with N>M for parallel smoothing. This advantage 
arises from the greater flexibility in communication topology. 
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Figure 7 Speedups versus the Number of Processors © 
(M=1024, A=8) 


There is suggestion from these two algorithms that 
implementation of both packet and circuit switching facilities in the 
iCN’s for multiprocessors will give lower communication cost and 
greater net speedup than either used separately. 
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Abstract -- The use of the Goodyear Massively Parallel 
Processor (MPP), an array of 16384 Processing Elements, is 
described for the solution of the shallow-water equations in a 
spherical geometry. These partial differential equations arise 
in Numerical Weather Prediction models and their fast 
solution is necessary. These are discretized with second order 
finite-differences on a latitude-longitude grid. Each physical 
grid point is mapped onto one MPP Processing Element. A 
set of difference equations results at each grid point, the 
same set at each grid point. This makes possible the use of 
a parallel algorithm for their solution at all grid points 
simultaneously. Only values from neighbourhood points are 
needed except for a few cases and thus routings between 
non adjacent Processing Elements are kept at a minimum. 
The resolution achieved with both available horizontal MPP 
dimensions is adequate and is comparable with fine 
resolution models currently in use. The exploitation of the 
MPP architecture is described and some of the problems 
facing the algorithm designer when confronting this novel 
computer architecture together with suggestions for 
handling them are indicated. Performance comparison 
estimates indicate that the MPP could achieve equal or 
better performance than more expensive supercomputers for 
such a problem. It is concluded that the MPP can 
competitively solve problems in the area of Numerical 
Weather Prediction. 


Introduction 


The Massively Parallel Processor system (MPP) [1] is a 
bit-serial SIMD computer designed and built as a 
collaborative effort between the NASA Goddard Space 
Flight Center and the Goodyear Aerospace Corporation, 
primarily, to support high-speed Image Processing. We 
intend to show here by means of a complete example how 
the MPP can be used very effectively to solve problems in 
computational physics, in particular the shallow-water 
equations occuring in numerical weather prediction. For a 
system like the MPP which was designed primarily for one 
particular area, namely Image Processing, we indicate how 
the available massive parallelism, if used carefully, can give 
excellent performance levels, comparable to, and for the 
example better than, current supercomputers. This 
application study shows that for some problems, the 
dimensionality constraints imposed by such an architecture 
are not a problem. The efficient use of the MPP in even a 
small set of real problems in this area would help the 
modellers in need of new and faster computational tools. 
This is the first such study done for the MPP. It is based on 
a parallel algorithm first set out by Kalnay and Takacs in 


(Research supported in part by NASA under contract 
NAS5-26405 
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[2]. It is interesting to note, as mentioned in [2], that the 
father of Numerical Weather Prediction L. F. Richardson, in 
his pioneering work [3] had envisaged a ‘‘human parallel 
computer” for performing weather prediction. More 
recently, references [4, 5, 6, 7] contain studies on the use of 
unconventional architectures to solve problems occuring in 
that area. 


MPP Description 


The Goodyear Massively Parallel Processor (MPP) is a 
bit-serial Single-Instruction Multiple-Data computer 
currently being built under NASA contract to support high- 
speed Image Processing. It consists of four main components 
(fig. 1]: 

Array Unit (ARU) 

Array Control Unit (ACU) 

Program and Data Management Unit (PDMU) 
Staging Memory (SM) 


The ARU consists of a 128X128 array of bit-serial 
Processing Elements (PE) [fig. 2], each having a IK bit 
RAM, extensible to 64K, giving an overall 2Mbyte ARU 
memory capacity. The interconnection network is_ of 
nearest-neighbour type with possible open, cylindrical, 
toroidal and spiral connections for the edges, all under user 
control. The ACU cycle time is 100ns. The ACU [fig. 3] 
executes the applications programs, scalar and array 
operations, and manages the I/O. It consists of the PE 
Control Unit (PCU), the I/O Control Unit (IOCU) and the 
Main Contro! Unit (MCU). These three modules operate in 
parallel and thus array and scalar arithmetic and I/O can be 
overlapped. Scalar data and application programs reside in 
MCU memory. The PCU memory contains the routines 
that operate on arrays of data in the ARU. Each PCU 
instruction 1s 64-bit wide allowing several PE elementary bit 
operations to be performed simultaneously. PCU routines 
are called from application programs residing in the MCU 
memory. A call-queue is provided to queue-up the calls from 
the MCU to the PCU, enabling the MCU to work 
concurrently. Since most of the MCU operations consist of 
subroutine calls or scalar operations that take much less 
time than the array operations, the PCU rarely waits for a 
new call to be issued by the MCU. 


The S-registers on each PE can shift planes of data without 
interfering with the computations except when a bit-plane is 
to be written into or read from ARU memory. Hence only 1 
cycle for reading or writing is stolen every 128 PE activity 
cycles (The time needed to bring a new 128X128 bit-plane 
in place in the S-registers.) The Staging Memory buffers 
data between the ARU and the secondary storage devices. 


The Program Development Software consists of two 
assemblers, one each for the PCU and MCU, a System 


Subroutine Library, a set of I/O Macros initiating I/O 
functions, a Control and Debug module and a Linker. 
Additionally a parallel version of Pascal, Parallel Pascal [8], 
will be available. 


If assembly language is to be used in order to make the 
most efficient use of the bit-serial features of the MPP, then 
to make the task of large-scale programming feasible, a large 
number of utility routines must be available. These routines 
can be roughly classified as 


1) array unit arithmetic (signed and unsigned integer, 
floating point), 


2)  scalar-array arithmetic (eg scalar by array multipli- 
cation), 


3) scalar arithmetic, 


4) array logical operations (all Boolean functions) and 


comparisons, 
5) routing operations, 
6) search operations, 
7) reduction operations, 
8) others. 


This is only a rough classification. Nevertheless it gives a 
flavour of the type of utility routines that should be 
available to the MPP programmer. At a somewhat higher 
level, the standard mathematical functions (as in Fortran) 
operating on array arguments, together with matrix and 
vector manipulators, must all be available in the utility 
library. The existence of efficient routines at that level is 
imperative for maximizing the performance. 


At the time of the experiment (Summer ’82), the MPP 
was still under development. Therefore the experiments 
were done on an MPP simulator system [9]. This system 
coordinates the execution of programs to emulate the 
actions of the MPP system. The user can write his 
programs in the MCU and PCU assembly languages using 
any of the available library routines. The resulting modules 
are loaded in the system and then the program trace can be 
followed through the available Debugger. As such it 
provides an _ excellent development tool for routines 
developed for the MPP. Even when the MPP is available, it 
would be more convenient to use the simulator first for code 
debugging and testing. The simulator system records the 
number of MCU and PCU cycles used during the execution 
of the application programs. It is written in the C 
programming language and runs under the Unix [10] or 
VMS [11] operating systems. 


Problem description 


The problem under consideration is the solution of a 
simplified form of the Navier-Stokes equations suitable for 
weather prediction in meteorology. Our example here could 
well serve as a guide for modellers in the need of faster 
computer rates in other Fluid Mechanic areas. 


The physical processes occuring in the atmosphere and 
as a result their mathematical formulations are non-linear. 
Even with the occasional simplifications, the arising 
equations cannot be solved analytically and require good 
numerical techniques. 


To decide on the suitable simplifications to make to the 
full set of equations describing the important physical 
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phenomena in the atmosphere one has to note that although 
what principally determines the long-term statistical 
properties of the atmosphere are the cumulative effects of 
heating and friction, these terms are locally small compared 
with the fluid-dynamical terms. Hence by concentrating on 
these latter equations and terms the modeller can draw 
important conclusions for the behaviour of the complete. 
system. The model to be solved concentrates on these terms 
as a first step towards a complete model, similar to the one 
described in [12]. It has been found that the so called 
shallow-water or barotropic equations contain the essential 
numerical aspects of the large scale prediction equations [13] 
governing an incompressible, homogeneous and hydrostatic 
fluid. These are strictly two-dimensional and thus refer to 
phenomena at a single fluid layer. The spherical polar 
coordinates (\,¢) where ) is the longitude and ¢ the latitude 
are chosen, as the most natural reference system for motions 
around the globe. At time ¢ and at position (\,¢) the 
dependent variables for the model are the height h of the 
free atmospheric surface under consideration equal to 
hr —-hpg, and the Eastward and Northward velocity 
components u and v respectively. The equations, broken 
down to \ and ¢ components, are written as follows: 
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where f = 2Qsing (with 0 the angular rotation frequency 
equal to 7.292 10° sec’) is the Coriolis parameter. The first 
equation comes from the law of mass conservation and the 
last two from the law of momentum conservation. They can 
be found in this form in [14]. For a model where bottom 
orography is not included, hg = H, a constant, and thus the 
hy variable above can be replaced by kh, This is the quass- 
linear hyperbolic system that must be solved, given suitable 
boundary and initial conditions. The equations are written 
in fluz form, using the time derivatives of the height-velocity 
products as this gives better results when discretized. The 
characteristic speed for the above system correspond to i) a 


pair of fast moving gravity waves having phase speeds of 
order of magnitude VgH and to ii) a slow westward moving 
Rossby wave which is of most importance for large scale 
meteorological processes. The disparity of speeds between 
these modes creates important constraints for the choice of 
the integration time step. 


Computational Grid and Discretization 


A latitude-longitude grid with constant angular 
increments Ad\ and A@ has been used. The grid is non- 
staggered (all variables are defined at each node.) Each node 
lies at the intersection of selected latitude and longitude 


circles. In contrast to the GLAS model {12] and as in [14] 
Ag 


2 
the polar singularity which arises from the acos¢ term in the 
equations above is avoided. For m longitude and n latitude 


circles 


the grid system is shifted by next to the poles. Hence 


Qn 


A=, Ag= > 
n 


m 


the first and last latitude circles lying next to the north and 
south poles respectively, correspond to ¢ = ra =). 
Boundary conditions are doubly periodic as with the GLAS 
model. In the East-West direction for both scalar and vector 
elements s(\+ 27,¢)=s(,¢). In the Nort-South direction, 
when variables are needed across the poles, the values are 
taken from the first grid point encountered by moving along 
the same latitude circle 180° around the pole. To keep the 
equations consistent the signs of the vector variables and the 


trigonometric functions must be changed when combined 


across the poles [15]. Hence 
S TOO ss a. n Ad ; 

UO, (5 eae )) = -v(\+ 1 bara ) and 
a(t (Z4 Sf) = ee r,t (2-2) For the solution of 
the system, a set of initial conditions for h, u, v at all 
locations must be available. For model testing, the 


conditions used were derived analytically as in [16], instead 
of extracting them from a weather map. 

The objective is to write finite-difference expressions 
ae < ) on the 
right-hand side of the equations (1). Then a _ time 
differencing scheme can be applied to integrate one step 
ahead in time and update the variables. The standard 
notation for the average and difference operators will be 
used: If a function s is defined on a grid having grid 
increment Az then the difference operator acting on s would 


be defined by 


and discretize the space derivatives ( 
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where s; means the value of the variable s evaluated at 


point iAz. As a result, a second order approximation to se 
z 


at point 7Az is given by 
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Extension to more dimensions are immediate. 


By using the aforementioned operators, the right hand 
sides of (1) could be discretized suitably and consistently as 
follows: 
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where each variable and constant is evaluated at each grid 
point. To advance forward in time an ezpltcit leapfrog 
scheme is used, giving that 


D n- Os 
g lat lg H+ 2At(— nat (4) 
where for notational simplicity s is the vector of unknowns 
[h,hu,hv|? defined at each grid point. 


Such a time-differencing scheme is frequently used with 
current models. The relevant theory [17] imposes an upper 
bound on the timestep that one could use to avoid the 


phenomenon of linear instability. Roughly, the finer the 
resolution, the smaller that bound. In [14] a stability 
criterion is shown for the model. A latitude-longitude grid 
with the converging meridians at the poles, forces a time 
step much smaller at the polar regions rather than the 
equators. As a result that minimum time step should be 
used. The scheme is of second order in space and time . To 
start the computations, data from two time levels are 
needed since the chosen time-differencing scheme uses 
information from the previous two time levels. This is 
achieved by using a simple forward time scheme for a 


‘fractional time step and then proceeding. 


MPP Implementation 


As mentioned above the problem was implemented on 
an MPP simulator. The simulator’s overhead and the 
sequential processing’ done by the host machine, made 
infeasible the simulation of a 128x128 ARU. Therefore a 
32X32 ARU was simulated. It will be noted where that 
difference would qualitatively alter the statements. 
Although assembly language was used for the program, the 
more elegant Parallel Pascal constructs will be used to 
indicate some features of the code. Explanations are 
provided for the parallel extensions whenever they are used. 
The spherical grid is mapped directly on the ARU: each PE 
and PE memory contained the variables and constants for 
the corresponding node. The northmost (southmost) PE row 
corresponded to the longitude circle immediately to the 
south (north) of the North (South) pole. 


The variable elements that are only functions of 
position like the Coriolis force component f, tang, cos¢ etc. 


can all be precalculated before the start of the integration 
loop and are constant (with respect to time) arrays. The 
time dependent variables h, u, v and their combinations are 
also defined at each point. Hence the above values are 
stored in each PE memory at fixed locations and their 
corresponding type declaration in a high-level language 
environment supporting parallel constructs (eg. Parallel 
Pascal) would be 
parallel array [1 .. N,1.. M of real 

where N is the number of rows (columns) of the grid and 
corresponding ARU systems and where the parallel 
declaration indicates that the array would be used heavily in 
parallel operations and the compiler is informed of the 
preference of the user to have the array stored in the ARU 
memory for the sake of efficiency. The scalar data is either 
global to the problem and used in the actual equations (2, 3) 
or is used for counting purposes. To the former category 
belong g, At, Ad, AX, a, etc. and to the latter all the 
variables keeping track of the number of steps and 
simulated time since the last important event 
(reinitialization, filtering etc. ) The limited available memory 
per PE (only 1024 bits) forces the programmer to look for 
ways to economize as much space as possible by using the 
Main Controller memory when appropriate. The availability 
of scalar-array arithmetic routines avoids any timing penalty 
for doing that. In order to avoid unnecessary repetition of 
floating-point operations some constant arrays and scalars 
can be combined from the initiation of the computations 
with the appropriate scale factors (eg. At, AX etc. ) Thus by 
storing in the ARU precomputed constant data, the number 
of required floating operations, which are expensive, could 
be reduced. This is not a problem for the example. For a 


more complicated model however, as the first configuration 
of the ARU memory is limited, the SM must be used. The 
available East-West ARU topology readily provides for the 
East-West periodicity, the first and last PE columns 
corresponding to \ =O and \ = 27-A) respectively. The 
situation is slightly more complicated for the simulation of 
the periodicity across the poles, as that cannot be mapped 
to the available ARU_ topologies. To combine the 
appropriate elements, the variable arrays would have to be 
cylindrically rotated (using the same East-West 


interconnection as above) by = columns. For the N = 32 


case, the cycle count is underestimated for the routing 
operations viz. the N = 128 case. As it will be seen however, 
the operation is infrequent enough not to significantly 
change the full timing estimates. 


The corresponding angular increments for the 
simulated and the real ARU are A\ = 2A¢= 11° and 
Ad = 2A¢ & 3° respectively. To compare, a currently used 
GLAS model utilizes AX = 5° and Ad = 4° whereas an 
‘ultrafine’ version has A\ = 3° and Ad = 2.5°. We thus see 
that the MPP dimensions are adequate for the described 
problem. Hence no ‘dimensional sacrifice’ need be done to 
size down the problem for a parallel implementation, an 
unfortunate practice in some examples demonstrating the 
usefulness of parallel computers. 


The operations applied at each grid point (PE) for the 
derivation of the spatial differences are local operations. The 
only place where this is not so is at some of the calculations 
for the latitudinal differences. For the scheme used, which is 
of 2 order, only elements from the immediate neighbours 
are needed. For a more accurate 4th order scheme, like the 
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one used in [12] the elements involved for a calculation at 
PE(i,j) would lie in the surrounding PEs at distances of at 
most 2. ; 


The inner loop of the code would go through the 
following steps in order to derive the new values at 
i=(n+1)At out of known data at the two previous time 
steps: 


1) From the available values for h, u, v at time level n 
use the equations (2,3) above to approximate the 
space derivatives and by adding the \ and ¢ contri- 
butions at each grid point (PE) as in (1) derive an 


approximation to 2A1(=) for the current time 
level. 


2) Apply (4) to derive the values at the new time lev- 
el. This only needs one addition per component of 
the s vector. 


3) Update the variables and counters. 


The PE memory restriction of 1K bits was observed: only 
680 bitplanes have been used and these could be further 
reduced down to 600. The numerical _ stability 
considerations mentioned above mean that in going from the 
simulated array to the real ARU a drastic reduction of time 
step must be performed in order to satisfy the CFL 
criterion. We have used At = 200sec for the case N = 32, 
but a rough stability analysis gives that for N = 128 At~20 
seconds! This is a serious constraint in going to the real 
ARU. Moreover, the allowed time steps near the equator are 
much larger than the ones allowed near the poles, due to the 
convergence of the meridians at the poles. Methods that 


could be used to overcome this are the Fourier filtering of 
the unstable waves [18] or the use of an implicit time 
differencing scheme [19] with no stability restrictions. 
Alternatively, since the time constraint is relaxed by going 
to a coarser grid, the number of longitude circles can be 
reduced (eg halved.) Since the North-South interconnections 
are not used, two independent simulations could be run 
concurrently (using the same global constants but different 
initial conditions). The one system would lie in the north 
half of the ARU (using half of the PE rows) and the other 
would lie in the south half. At the end of each computation 
cycle results for both systems would be simultaneously 
obtained. 


In order to avoid the phenomenon of non-linear 
instability [20], a filtering procedure [21] was used. Such a 
filter applied in the East-West direction to an array variable 
Gig4 ~ 29iy + Giz-a 
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and then combines the result with the original array values 
qj. As explamed above, things are slightly more 
complicated in the North-South direction. This filtering 
results in the elimination of 2Az waves where Az is the 
East-West grid distance which may introduce the instability. 
This procedure was applied every hour on A, u, »v, eight 
times in each direction resulting in a 16” order filter. The 
MPP implementation of the above operator is very efficient, 
taking full advantage of the parallelism. 

In a Parallel Pascal environment 


q at location (%, 7), repeatedly calculates 


const 


order = 8 ; 
type 

arr = parallel array(! .. N,1.. M of real ; 
var 

q, qsv, temp : arr ; 

+: integer ; 
begin 

g3v := q; 

for 1:== 1 to order do 

begin 


temp := q - rotate(gq, 0, 1); 
q :=rotate( temp, 0, -1)- temp; 


q:= q/2.; 
q:= q/2. 
end; 
q := Q3v- | 


end; 

The rotate(a, 1, 7) function returns an array consisting of 
the elements of a circularly rotated by distances 1, 7 along 
the two dimensions. The repeated divisions by 2 are done 
by special purpose routines. The bit-serial nature of the 
arithmetic makes divisions and multiplications by 2 and its 
powers considerably faster than if using the floating point 
division by 2 routines. 


Description of results 


The resulting simulation corresponded to east to west 
moving wave patterns for h, u, v. At selected time intervals 
the contents of the corresponding ARU memory locations 
were examined. A software interface was set between the 


simulator and a colour graphics display terminal and the 
real valued array variables were scaled and mapped on a 
gray scale. The resulting integer valued arrays produced an 
image of the moving wave. On the real ARU this 
correspondence can be handled very quickly. The scaling 
and integer transformations occur at each grid point and full 
advantage of the parallelism is taken. Thus even for this 
output oriented consideration the MPP can be used very 
efficiently. Because of the extraordinary software complexity 
of the simulator and the slowness of the host machine, the 
simulation was only for a few hours. 


MPP Timings and Comparisons 


The number of consumed cycles for each routine and 
its function are given in the table. The number of PCU 
cycles is more important as the MCU works ahead and in 
parallel and takes less time. 


The number of cycles given for each routine includes 
the cycles spent by any called subroutines during the 
routine’s execution. As can be seen from the timing table, a 
main step, consisting of the space derivative calculations 
dtcalc (by far the most time consuming routine amongst the 
frequently called ones) and of update takes about 43,000 PE 
cycles, or for 100 ns/cycle there are about 230 iterations/sec. 
With a 200sec time step and excluding any overhead, we get 
that about 1 day is simulated in 2 seconds. This of course is 
a very rough estimate. If filtering is needed, once per hour 
say, the time estimate above does not significantly increase 
despite the cost of filter. 


It is fair to say that there is little agreement on the 
way the performance of unconventional machines could be 
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MCU PCU 
(standard library floating-point routines) 


Name 


Description 


fmul multiplication 63 787 
fsub subtraction 61 381 
fadd addition ol 381 
fdiv division 101 1031 
fhalf division by two 100 266 
fmulsa — scalar-array multiplication 64 791 


(user-written special purpose routines) 


barl longitude averages 162 745 

barp latitude averages 176 = 1041 

ndifr longitude finite differences 136 1269 

ndifc latitude finite differences 163 1668 

filter Shapiro 16" order filter 18357 96408 
(user-written main step routines) 

dtcale space derivatives 4706 38835 


update _ leapfrog update 390 3403 


evaluated and compared. The MOPS/MFLOPS rates are 
some of the most popular measures but even for those, their 
uniform validity across the machine spectrum is disputed. 
Other standards have also been proposed [22, 23, 24]. As 
long as we are interested in particular problems and not 
general evaluations a good strategy is to simply compare the 
timings for their solution on the examined machines [25]. 
Even this however may not be a fair criterion since the 


fastest algorithms for each machine would possess different 
numerical properties. Dimensions also would probably be 
chosen to suit the machine (magic numbers like 64 or 128 
for the Cray, the DAP or the MPP) rather than the 
modeller. Since our experiments have been conducted on a 
simulator it is only approximate comparisons that we could 
make. The DAP [26] array is 1/4 of the MPP array and also 
has considerably slower MFLOP rates. As a result, for this 
problem it would not compete with the MPP. On the other 
hand the currently used DAPs have 4K bits of memory per 
PE (to be extended to 16K) and hence it could be used to 
model a multi-level model, or one containing more equations, 
without the constraints imposed by frequent I/O exchanges 
which might be needed for the MPP in its initial memory 
configuration. Experiments on the CRAY-1 [19] for a similar 
model with a _ resolution of A\ = Ad = 12° required 
8.5X10‘sec/step, whereas with A\=A¢=4’ the 
integration required 4.9710 °sec/step. The AX = 2A¢ © 11° 
resolution achieved with the MPP simulator took about 
4.3X10%sec/step. As the parallelism is almost fully exploited 
and redundancy is kept at a minimum by going to the full 
ARU this timing estimate would only increase slightly but 
the available resolution would be A\ = 2A¢ x 3° which is 
superior to the finer resolution in the Cray model above. 
Therefore a better resolution and comparable or better 
timing would be achieved with the MPP. Moreover, it would 
be possible to have concurrent solutions, starting from 
different initial data as mentioned above. 


Conclusions 


We have discussed the use of the MPP for the 
integration of a set of non-linear PDEs frequently occuring 


in Numerical Weather Prediction. We have found that the 
MPP can be used for the solution of such problems. We 
have talked about the problems facing the algorithm 
designer and suggested methods for coping with them. The 
MPP gives much _ better performances for integer 
computations. It is thus worth investigating the possibility 
of using fixed or block floating point arithmetic in the 
computations. The use of the MPP makes efficient and 
possible the simulation of General Circulation models over 
the entire globe with adequate horizontal resolution. 
Consequently the modeller doesn’t have to worry about 
imposing artificial lateral boundary conditions. A complete 
model would have multiple vertical levels. It would also take 
account of thermal phenomena which here were ignored, by 
incorporating more equations. A 4th order space differencing 
scheme would be preferable. For multiple level simulation 
the PE memory becomes inadequate. The addressing 
capability of the PE index registers is for 64K bits per PE 
memory and a future system could contain such memory. 
Nevertheless the first available MPP will have to use its 
Staging Memory as an immediate active buffer area. The SM 
capacity would initially be 4Mbytes and can be expanded to 
64Mbytes. In that configuration rates of 160 Mbytes/sec are 
achieved. The available permutation network has many 
capabilities for accessing subarrays or other patterns from 
and to the storage area. A few of the not too frequently used 
arrays for a large model could be stored in the stager and 
brought in the ARU under some paging policy designed to 
minimize interference with the computations. A preliminary 
theoretical study of the problems related with memory 
allocation and management for the Staging Memory can be 
found in [8]. Real or various analytical initial data would be 


gathered in a data base and from there it would initialize 
the MPP arrays. Multiple concurrent simulations could then 
be run as suggested above or a single fine-grid simulation 
could be instigated and the resuits at selected time steps 
would be displayed. For long predictions, periodic updating 
of the variables could be done by utilizing new observed 
data. Overall, the MPP would be a powerful computational 
tool for modellers. 
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AN M-STEP PRECONDITIONED CONJUGATE GRADIENT METHOD FOR PARALLEL COMPUTATION 


Loyce Adams 
Institute for Computer Applications in Science and Engineering 
Hampton, Virginia 23665 


Abstract a This paper describes a 
preconditioned conjugate gradient method that can 
be effectively implemented on both vector machines 
and parallel arrays to solve sparse symmetric and 
positive definite systems of linear equations. 
The implementation on the CYBER 203/205 and on the 
Finite Element Machine is discussed and results 
obtained ‘using the method on these machines are 
given. 


Introduction 


In this paper we are concerned with tic 
solution of a sparse N < N/~ system of symmetric 
and positive definite linear equations 


Ku = f£ (1.1) 


by preconditioned conjugate gradient (PCG) methods 
on both vector computers and parallel arrays. 
Several descriptions of these methods appear in 
the literature; see for example, Concus, Golub, 
O’Leary [1976] and Chandra [1978]. Also, Schrieber 
[1978] discussed the implementation of conjugate 
gradient (CG) on vector computers and Podsiadlo 
and Jordan [1981] discussed its implementation on 
the Finite Element Machine under construction at 
NASA Langley Research Center. ea 7 

The PCG method solves the system ku = f 
where 


eS RO. ae uy ee eS 
Q is a nonsingular matrix, and the symmetric and 
positive definite preconditioning matrix is given 
by M= aqi. The algorithm for the solution of 
u directly is described in Chandra [1978] and is 


given below where u, r, fF, and p-= are vectors 
and (x,y) denotes the inner product xy. 


(1) Choose u°® 


(5) For k= O,1,°°°k ox 


p,Kp 


k+l ok k 


(2) u =u + Op 
Q: te aS ee cthen 
stop, otherwise continue. 
(4) kth 7 r* = oKp * 
(5) my kt = rktl 
(6) 8B (yk+1 k+1) 
or 
(7) pktl = yet! + Bpk 
Algorithm 1. PCG Algorithm 
We note that the standard conjugate gradient 


algorithm results by choosing M = I. 
For vector machines, if M =I, all steps of 


the iteration loop except (1) and (6) can be 
vectorized. In particular, the multiplication 
Kp, for kK Sparse, vectorizes after a suitable 


ordering of the equations and will be discussed in 
detail in Section 3. The difficulty arises in the 
formation of the inner products necessary to 
calculate a and 8. These calculations require 
a phase in which N_ partial sums must be added 
together and therefore do not vectorize well. 

For parallel arrays like the Finite Element 


Machine (Jordan [1978], Adams [1982]), the 
calculation of u,r, and p can be distributed 
to the individual processors and the necessary 


communication between processors can be performed 
on the dedicated local links. The convergence 
test in (3) can be performed by using the flag 
network. However, for a large number. of 
processors, the calculations of © and 8 can be 
expensive since the number of values to be summed 
for each inner product is equal to P, the number 
of processors. Jordan [1979] realized that this 
was potentially detrimental to the efficiency of 
the method on this machine, and as a result, a 
special hardware circuit (sum/max) was designed to 


perform the P_ sums in O(1og,P) time. 
; Since Algorithm 1 has two inner products per 
iteration that will become costly as N (on 


vector machines) or P (on arrays) increases, a 
natural goal is to devise a preconditioner that 
will reduce the number of CG iterations, and hence 
the number of inner’ products, while being 
inexpensive to implement. In the next section 
preconditioners that are based on taking m_ steps 
of an iterative method are described. In Section 
3, the implementations of these methods on the 
CYBER 203/205 and the Finite Element Machine are 
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given for a system of equations that results from 
an example structural engineering problem. 
Results for this problem on the CYBER 203 and the 
Finite Element Machine are given in Section 4. 


2. M-Step Parallel Preconditioners 
2.1 Choosing M 


The preconditioned conjugate gradient 
algorithm of the last section requires a symmetric 
and positive definite preconditioning matrix M.- 
The question is how to choose M_ so that the 
condition number of K, 


nee 


max 

b 
a i 
is as small as possible. 

The best choice for in the sense of 
minimizing K(K) M=K but this gains 
nothing since kKr=-r is just as difficult to 
solve as ku =f. A:.. ue 2f preconditioners that 
appears to be easily imp.iemented on parallel 
computers arises by choosing M_ to be a splitting 
of K that describes a linear stationary 
iterative method. As an example, the SSOR 
splitting of K yields 


M = 55 (bp-)p-! (dp-v) 


where D,-L, and -U 
lower, and strictly upper parts of K 


M 


(2.1) 


are the diagonal, strictly 
respective- 


ly. This splitting has been considered extensive- 
ly in the literature as a preconditioner; for 
example, refer to Concus, Golub, O”Leary [1976] 


and the references therein. Now, if the matrix 
K is ordered by the Multicolor ordering (Adams 
and Ortega [1982]), the system Mr = r_ can be 
implemented on parallel computers as a forward 
followed by a backward Multicolor SOR iteration 


applied to Kr=r with initial guess r (9).=0 and 
will be explained in more detail in Section 3. 
The question now arises whether it would be 
beneficial to take more than one step of a linear 
stationary iterative method to produce a 
preconditioner M that more closely approximates 
K. If this is done, the resulting preconditioning 
matrix is 
m-1)-1 
M = P(I+G+...4G” ) >. (2.2) 
Now, M must be symmetric and positive definite to 
be considered as a preconditioner. The necessary 
and sufficient conditions for M _ to satisfy these 
requirements are given in Adams [1982] and we only 
note here that if P is the SSOR splitting matrix 
these conditions are met. We also note that 
Dubois, Greenbaum, and Rodrique [1979] considered 


a truncated Neumann series for K™ as a 
preconditioner which corresponded to a Jacobi 
splitting where P = diag(K)- 


Even though the preconditioner in (2.2) for 
the SSOR splitting is symmetric and positive 
definite, the question of how well the resulting 
PCG method will reduce the number of CG iterations 
must be answered. In Adams [1982], for the SSQR 
splitting, the condition number of the matrix K 
of (1.2) was proven to decrease as the number of 
steps of the preconditoner in 42.2) increases; 


K(K J 


however, the maximum ratio of e(K.) was shown to 
KLK, 
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be m- In practice, for larger m, this reduction 
may not be enough to balance the increase in the 
work that must be done by the preconditoner (as 
results in Section 4 verify). However, by 
parametrizing this preconditoner, the method is 
very effective. This parametrization is briefly 
discussed in the next section and the parameters 
for the SSOR splitting are given. 


2.2 Parametrizing M 

Johnson, Micchelli, and Paul [1982] have 
suggested symmetrically scaling the matrix K_ to 
have unit diagonal and then taking .m terms of a 
parametrized Newmann series for K ~ = (1-c)~! as 
the value for M!. This corresponds to a 


symmetric preconditioning matrix whose inverse is 


a polynominal of degree m-l1 in G, 
a 2 m-1 
Mh aI + a6 + a6 +. eet a 1° (2.3) 
derived from the Jacobi splitting, 
of K; hence, the solution to M r =r can be 


implemented by taking m steps of the Jacobi 


iterative method applied to Kr =r _ with initial 
guess 7 6) = 0. Johnson, et-al- choose the 
a.°s so that the eigenvalues of Mn K, and hence 


those of M,, are positive on the interval 
[A A, ] that contains the eigenvalues of K 
are as close to 1 as possible in some sense such 
as the min-max or th least squares criteria. 
Clearly, if m= 1, M, K = %K and the condition 
number of MK is the same for all 4 # 0. 
Hence, we are only interested in m> 1. 

We now generalize this idea for any splitting 
of the matrix K, 


and 


K=P-Q. (2.5) 


If G = plo, then by parametrizing (2.2), the 
inverse of the m-step preconditioner becomes 
-l 2 m-1l)_-1 

Mm (a,I+0, Gta, 6 te.eta 1G \pP (2.6) 


and will be symmetric if P 
choose the values of 4, so that the eigenvalues 
of M, K are positive on the interval hrAn! 

that contains the eigenvalues of P °K and are as 
close to 1 as possible in some sense such as the 
min-max or least squares criteria. For the least 


is symmetric. We 


squares criteria, the values’) of OL that 
correspond to the SSOR splitting are given in 
Table 1 for m= 2,3, and 4. 


Table l. 
a Values for the mstep SSOR PCG Method 


m fo a a O 


me 1 ae me 
2 1.00 5.00 
3 1.00 -2.00 7.00 
4 1.00 7.00 -24.50 31.50 


In the next section we describe how to implement 
the mstep parametrized SSOR PCG method on the 


CYBER 203/205 and on the Finite Element Machine 
and in Section 4, results on these machines are 
given. 

3. Implementation of the mstep SSOR PCG Method 
. We first describe the algorithm for solving 
Mr = r, where M is the preconditioning matrix 
given by (2.6). To be concrete, this description 

will be given for the following test problem. 

The domain considered will be a rectangular 
plate discretized with triangular finite elements 
over which linear basis functions are defined. The 
nodes of the triangles are colored Red, Black, and 
Green so that nodes on a given triangle are 
different colors as shown in Figure 1. This 
coloring, as described in Adams and Ortega [1982], 
decouples the equations so that an implementation 
on either vector or array computers is possible as 
will become more apparent later in this 
discussion. 


Plate (Triangular Elements) 


Figure l. 


The problem is to determine the displacements, 
say ue and v, in the x and y_- directions 
respectively at each node in the plate whenever 
the plate is loaded on one edge and constrained on 
another. The partial differential equations of 
plane stress that govern these displacements are 
well known, see Norrie and DeVries [1978], but do 
not contribute to the discussion here. The 
important point to make is that the stiffness 
matrix K of (1.1) will be symmetric and positive 
definite and will have dimension 2ab x 2ab 

where a is the number of rows of nodes and b 
is the number of columns of unconstrained nodes (2 
unknowns at each node), and each row of K_ will 
contain at most 14 #=nonzero elements which 
correspond to the grid point stencil for linear 
triangular elements shown in Figure 2. 


*(u,v) °(u,v) 


(u,v), 


‘Cayv) | a) 
*(u,v) ~*(u,v) 
Figure 2. Grid Point Stencil 


Observe from Figures 1 and 2 that while there 
is no coupling between the equations at two nodes 
of the same color,. the equations at a given node 


do couple. Hence, to completely decouple the 
system, six colors are necessary; namely, Red(u), 
Red(v), Black(u), Black(v), Green(u), and 
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Green(v). Now, if the equations at the nodes in 
Figure 1 are numbered by these six colors from 
bottom to top, left to right, the system Kr =r 
has the form, 


Diz Bia Biz Big Bis Bie] | 61 
Bly Dy By3 Boy Bos Bog] | Tp 
By; B35 Ds3: Pay. P35 P36). 7 73) * 
Bi, B24 33, Pag Bys Basl | ty (3-1) 
Bis B25 B35 B45 Ds5 Bs6l | r5 
Big B76 B36 B46 356 Peel | Te 
where B192B342B56> and Ds4> 1 = 1 to 6 are 


diagonal matrices. 

The SSOR iteration can be realized by a 
forward followed by a hackward Multicolor SOR 
iteration, (Adams and Ortega [1982]), but is only 
as expensive as one Multicolor SOR iteration since 
a technique of Conrad and Wallach [1979] can be 
used to save results in an auxiliary vector, y, 
from the forward pass to be used in the backward 
pass. The procedure is given below for solving 
Mr r of Algorithm 1. The relaxation parameter 
W of the SSOR method causes no problems in the 
implementation and will be set to one here for 
simplicity. 


(1) r=0; y=#0 
(2) For s=l1 to wm 
(1) For c=1 to 6 
e~l 7° 
(1) Form x= =) B r 
j=1 je j 
(2) Solve Dove =x + Va + Oph 
(3) Set y,=X 
(2) For c#= 5 down to 2 
| ; ; 
(1) Form x = =) Br 
j=ct1 “? J 
(2) Solve Dir, = x + Y 6 ee 
(3) Set y =x 
ray 6 A 
(3) Solve Div) = ee + yy +0 ry 


Algorithm 2. m-step 6-color SSOR 

Notice that the values of oa above are the 
parameters that were given in Table 1, and if no 
parametrization is desired, these are simply set 
to one. We also point out that Algorithm 2 can 
easily be modified to solve problems whose domains 
are discretized by more complicated finite 
elements or finite differences as long as a 


multicolor ordering is used. For more details see 
Adams and Ortega [1982]. We now turn to _ the 
implementation of Algorithm 1 in coniunction with 
Algorithm 2 on the CYBER 203/205. 


3.1 CYBER 203/205 Implementation 
On the CYBER 203/205, vectors consist of 


contiguous storage locations and maximum 
efficiency of vector operations is achieved for 
very long vectors. For vectors of length 1000 
around 90% efficiency is obtained, but this drops 
to approximately 50% or less for vectors of length 
100 and 10% for vectors of length 10. 

To achieve the maximum vector length for our 
test problem the ou equations at the Red nodes 
(left to right, bottom to top) including the 
constrained nodes are numbered first, followed by 
the corresponding v equations at the Red nodes, 
then by the Black u, Black v, Green u, and Green 
4 equations. The numbering of the constrained 
equations is necessary for ease of implementation 
given the CYBER’s contiguous storage requirement 
but also increases the vector length from 1/3ab 
to ya(btl). Of course, the actual updating of 
the storage locations corresponding to these 
constrained nodes is prohibited by the control 
vector feature on this machine, see Ortega and 


Voigt [1977], and for large values of a and b 
little inefficiency is incurred. For a unit 
Square plate, the maximum vector length for our 


a 


test problem is % and is around 1000 when 


equivalently when the width of each 
triangle is equal to 1/54. 

The contiguous storage requirement coupled 
with the manner in which the nodes are colored 
imposes a restriction on the number of nodes that 
can be in each row of the plate. In particular, 
the last node in the first row must be Black so 
that the coloring R/B/G/R/B/G, etc- wraps around 
from one row to the next. k 

Now, the calculations of Ku° and Kp in 
Algorithm 2 can be done by a= straightforward 
generalization of Madsen, Rodrique, and Karush’s 


a * 55, or 


18 nodes/procesor 


Figure 3a. 


[1976] matrix multiplication by diagonals scheme 


since K of (3.1) has the structure shown in 
(3.2) (and will be stored by these diagonals as 
well): 
R B G 
u 4 u 


nw 
ll 
fo 


= S 
\ . 
Kes e 
v LON SK << ~~ SS (3:2) 
u 
AX a ean : 

LSS 
Also, the multiplication of By Ps and Ba : in 
Algorithm 2 can _ be performed by eo same 
technique. \ The subtraction in the convergence 
test lu -u Il, < € vectorizes and the absolute 
value is performed by the vector absolute value 


function that is available on the CYBER. The 
inner products for the calculation of a and 8 
are done by a call to an inner product routine 
which utilizes the machine’s vector hardware; 
however, the additions of the partial sums make 
this operation considerably slower than the other 
vector operations required in the algorithm. 

Next, we turn to the implementation of 
Algorithm 1 in conjuction with Algorithm 2 on the 
Finite Element Machine. 


3.2 Finite Element Machine Implementation 
The first task for the implementation on this 


machine is to assign the nodes (and hence 
equations at the nodes) of the plate to the 
processors. This is done by assigning each 


processor, as nearly as possible, an equal number 
of Red/Black/ and Green unconstrained nodes as 
illustrated in Figures 3a, 3b, and 3c, where in 
each Figure, the node colorings may repeat beyond 
the region shown. 


Figure 3b. 


9 nodes/processor 


| RASS 
ena em 


| 


Figure 3c. 
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In contrast to the CYBER implementation we need 
not be concerned with numbering the constrained 
nodes, but instead we should. require that each 
processor receive an equal distribution of each 
color of the unconstrained nodes. 

Since memory is distributed on the Finite 
Element Machine, each processor stores the portion 
of u,p, r,; r and K _ that corresponds to its 
collection of nodes. For each equation that is 
assigned to a processor, 14 storage locations are 
reserved for the nonzero coefficients of K_ that 
correspond to the grid point stencil in Figure 
2. For more information about these data 
structures see Adams [1982]. In addition, storage 
must be reserved in each processor for the portion 


of p that must be received from neighbor 
processors during the calculation of Kp each 
iteration. For example, in Figure 3b, processor l 


must reserve storage for the components of p 

that correspond to the 3 border nodes in processor 
3 and the 3 border nodes in processor 2, but no 
components are received from processor 4 since no 
nodes in processors 1 and 4 share a common 
triangle. This same storage may be used initially 
for u? during the calculation of Ku® : 

Similarly, storage must be reserved for the fr 

components associated with the equations at border 
nodes in neighbor procesors for the 


A aA 


multiplications of By ot, and Ba4T3 in 
Algorithm 2. 
The sending and receiving of the border p 
components in each CG iteration in Algorithm | and 
the border r components during each step of the 
preconditioner in Algorithm 2 is only (for 
rectangular regions) between neighbor processors 
and in particular for our test problem will 
require six of the machine’s eight nearest 
neighbor links as shown in Figure 4 for processor 


P. 


| 7 
= sy 

wa be 
Figure 4. FEM Local Links 


Hence, the communication required for the mstep 
SSOR preconditioner on this machine is completely 
local and the amount of data that a given 
processor must communicate can be seen from Figure 
3 to be dependent on its number of neighbors as 
well as the dimension of the rectangle of nodes 
assigned to it. To reduce the time required for 
the 1/0, the values of each color to be sent to a 
given neighbor can be packaged and sent as one 
record and likewise for the values of a particular 
color to be received from a given neighbor. If 
this is done, it becomes advantageous to think of 
the two equations at the same node as being the 
same color, because, on this machine, it does not 


matter that they couple since they will always be 


assigned to the same processor. 

The convergence test in Algorithm 1 is 
implemented by the signal flag network. Each 
processor raises its convergence flag whenever its 
portion of u values are within the stopping 
criterion. The processors are then synchronized 
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and tested to see if all flags are raised; 
the iteration stops if not, 
lowered and the iteration continues. 

Lastly, we summarize our remarks about ee 
Finite Element Machine implementation of Algorithm 
2 by providing a parallel version in Algorithm 3 
that will be executed by processor p- The 
subscript p denotes the portion of a vector that 
is assigned to processor p, the subscript n 
denotes the portion of the vector that is received 
from all of processor p’s neighbors and the 
subscript t denotes the total vector which 
consists of the components received by, as well as 
those assigned to, processor p.- 


if so, 
all flags are 


(1) Pee 0; Vp = () 
(2) For s=1 to m 
(1) For c# 1 to 6 
c-l . 
(1) x= -) oe : 
j=l J Jo 
(2) c,pte,p x + vs + Cesk p 
(3) yx 
7p 
(4) If ec mod 2= 0 then 
(1) Send border portion of 
Te-l,p and To,p 
(2) Receive Yo-lyn and 
Yeyn 
(2) For c= 5 down to 2 
; is 
(l) x= -) B 
jectl om j,t 
(2) Da ptesp =x + Vp + Comet p 
(3) = xX 
yp 
(4) If ec mod 2 #0 then 
(1) Send border portion of 
Totl,p and Ye,p 
(2) Receive Yotl,n and 
Te,n 
nN 6 nN 
(3) Solve Di, Ti epithe + ve +ar 


Algorithm 3. FEM m-step 6-color SSOR 


4.Results 


The example plane stress problem was run on 
the CYBER 203 at the NASA Langley Research Center 
for a unit square plate for varying mesh sizes. 
Table 2 gives the number of iterations, I, and 


time, T, in seconds to solve this problem using are denoted by P. the number of rows in the plate 
m= 0-10. The parametrized preconditioner results by a, and the maximum vector length by v. 
Table 2. CYBER 203 Iterations and Timings m-step SSOR PCG 
v = 22 v = 41 v_ = 132 v = 561 v = 1282 v = 2134 
a = 8 a = ll a = 20 a = 4l a = 62 a = 80 
nm oo. ou i Tr I < is [ a T i T 
9) 112 - 133 157 e215 271 565 536 3.293 788 11.845 929 22.780 
l 52 ~129 66 -184 111 -454 214 2.373 311 7.832 395 17.194 
2 38 - 143 50 - 208 79 -478 152 2-428 221 7.773 280 17.380 
2P 31 ~116 40 -167 61 ~369 118 1.885 172 6.052 218 13.534 
3 31 e155 39 -216 65 ~520 124 2.585 181 8.174 229 18.469 
3P 24 2127 30 -167 46 -369 88 1.836 129 5.828 163 13.151 
4P 22 - 138 24 166 35 «350 67 1.726 99 5.471 124 12.306 
5P 19 -143 20 -167 29 (2347 56 1.716 82 5.345 104 12.260 
6P 18 159 18 175 25 - 348 47 1.670 70 5.263 88 12.011 
7P 26 -413 43 1.739 64 5.451 80 12.410 
8P 21 375 36 1.634 54 5.139 69 11.985 
9P 33 1.660 48 5.056 61 11.731 
10P 31 1.709 44 5.070 55 11.594 
It should be noted that the inner product routine The inequalities in (4.2) explain for larger 
that was used for these results was developed at problems when more steps of the preconditioner 


Langley and is optimized for the CYBER 203. 
Several observations can be made from these six 
test cases. 

(1) The parametrized preconditioner is 
better with respect to both the number 
of iterations and the execution time 
than the corresponding unparamet rized 


preconditioner. 
(2) The optimal number of steps of the 
parametrized preconditioner increased 


as the vector length increased. 


In relation to (2), an interesting question 
is to determine how many steps would be beneficial 
for a large problem. The answer to this is quite 
simple if the number of iterations, Nw could he 
expressed as a function of m, since the execution 
time of the m-step method can be expressed as 

T(m) = Ni(A + mB) (4.1) 
where A is the time for one outer conjugate 
gradient iteration and B is the time for 1 step 
of the preconditioner. Now if we assume that 
Natl < Np taking m+l steps is more beneficial 
than taking m steps whenever 


(1) (m+1 )N - nN < 0. (This means’ that 
the total number of inner loops is less for, ml 
steps) 
or (2) B/A < Tg 
mtl m 


4] 


should be taken. For instance, the values of the 
left and right side of inequality (2) when  m=9 
are (.81,.15), (.68,.5), and (.76,6) for a= 
41,62. and 80 respectively. Hence, ten steps 
are preferable to nine only for a = 80. 

We now give the Finite Element Machine 
results. The example plane stress problem with 6 
rows and 6 columns of nodes (60 equations) was 
solved on al, 2 and then on a 5~processor Finite 
Element Machine using the m-step SSOR PCG 
method. For this problem the assignment of 
unconstrained nodes to the processors is shown in 
Figure 5. 


be 


N! 

hs 
R 
NJ 
: 


Two Processors 


Five Processors 


Figure 5. FEM Processor Assignments 

Observe from Figure 5 that for the two and five 
processor assignments each processor has an equal 
number of R, B, and G_ nodes as well as an 


equal number of border nodes to be communicated. 
Therefore, in the absence of communication time 
and any differences in processor speeds, a speedup 
of two (five) over the one processor case should 
be realized. 


The number of iterations and the time in 
seconds for the above assignments are given in 
Table 3. The speedups for the two and five 
processor assignments also are included. 


Table 3. FEM Iterations, Timings, Speedups m-step SSOR PCG 
pet 2 ps5 
m “E iE I Speedup aL T Speedup 
0 48 63.35 48 33.01 1.92 48 17-70 3.58 
i 19 47.90 19 25.85 1.85 19 14.85 3223 
2 13 48.75 13 26.65 1.83 13 15.50 3.15 
2P 11 41.95 11 22.95 1.83 11 13.30 3.15 
3 11 54.95 11 30.15 1.82 11 17.65 3e1t 
3P 8 41.25 8 22.75 1.81 8 13.25 Sr 
4 10 62.40 10 34.30 1.82 10 20.20 3.09 
4p 6 39.80 6 22.00 1.81 6 12.90 3.09 
5P 5 40.60 5 22.50 1.80 5 13.25 3.06 
6P 5 47.05 5 26.20 1.80 


Several observations can be made from Table 3. 


(1) The effectiveness of the preconditioner 
as a function of m was the same for the 
sequential and two and five processor 
cases (4p,5p,3p,2p,1,2,3,4). 

(2) Taking more than one step of the 
unparametrized preconditioner was not 
advantageous. 

(3) The overhead for the CG(m=0) algorithm 


was less than that for the PCG Algorithm 
because for two and five processors the 
communications for the preconditioner 
rather than for the inner. products 
dominate the overhead. 


In regard to (3), if we keep the number of nodes 
per processor fixed and continue to add processors 
up to a certain number, say n., the overhead for 
the preconditioner will still be more than that 


for the CG method and hence m = 3P or 2P_ may 
become optimal; however, as the number of 
processors increases beyond Ny» the value of 
B/A in (4.2) will continue to decrease until 

m 2 4p steps of the preconditioner will be 
optimal. The behavior of the mstep PCG Algorithm 


can be modelled as a function of the number of 
processors, the problem size, and the relative 
speed of arithmetic to communication times for the 


machine. For more details, see Adams [1982]. 
5- Summary and Conclusions 
The mstep multicolor SSOR preconditioned 
conjugate gradient method described herein has 


been shown to be effective on vector computers and 
for a small problem was effective on the Finite 
Element Machine. As more processors and _ the 
sum/max hardware circuit become available on this 
machine, the method will be tested on larger 
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This method does not face the usual 
in choosing the optimal relaxation 
W, for the multicolor SSOR method, 
w= 1 is 


problems. 
difficulty 
parameter, 
since for this ordering and few colors 
a good choice, see Adams [1983]. A problem still 
remains in applying the method to irregular 
regions since the grid must be colored and for 
array machines must also be distributed to the 
processors in light of this coloring. 
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MINIMIZING INNER PRODUCT DATA DEPENDENCIES 
IN CONJUGATE GRADIENT ITERATION 


John Van Rosendale 
Institute for Computer Applications in Science and Engineering 
| Hampton, VA 23665 


Abstract 


The amount of concurrency available in 
conjugate gradient interation is limited by the 
summations required in the inner product 
computations. The inner product of two vectors of 
length N requires time c*log(N), if N or more 
procesors are available. 


This paper describes an algebraic 
restructuring of the conjugate gradient algorithn, 
which minimizes data dependencies due to inner 
product calculations. After an initial start up, 
the new algorithm can perform a conjugate gradient 
iteration in time c*log(log(N)). 


Introduction 


Conjugate gradient interation is a method of 


Linear equation solution of great practical 
importance. It can be used to solve any linear 
system 
Au = b 

where A is symmetric, positive definite, and can 
be quite efficient when coupled with various 
preconditioning techniques. However, CG 
(conjugate gradient) iteration involves the 
computation of inner products at every 
iteration. On parallel computers with large 
numbers of processors, the data dependencies 


inherent in these inner product calculations will 
limit the speed of conjugate gradient iteration 


for large sparse linear’ systems. See, for 
example, Schreiber [1981] and Adams [1982]. In 
fact, given sufficiently many processors, the 
summation fan-ins in the inner product 


calculations will dominate the computation time on 
nearly all large sparse linear systems occurring 
in practice. 


Conjugate Gradient Iteration 


This paper presents a_ solution to _ this 
problem through an algebraic restructuring of the 
CG Algorithm. Consider first the standard CG 
iteration. One of a number of mathematically 
equivalent forms of it may be given as follows: 


(0) 


u arbitrary, 


wath) 2 4S ig ae eee 
som e a | n=0, 
eS Rp a 
~™) (al) a ee healers =r 
me) (n) 
“a Co pent), i 
(n) _(n) 
r = 7 ) = yy? n=0,1,°°°. 
‘ (p\"’ Ap) 


The data dependencies here are severe. 


Cae 


One cannot 


generate until a and are 


-1 n-l 
quantities involve 


products dependent on r(M-l) As pointed out 
above, an inner product on vectors of length N 
requires time c*log(N). Thus it would seem that 
a CG iteration could not be done faster than in 
time c*log(N). 


known. But these 


inner 


Idea of New Algorithm 


This natural seeming idea, 
iteration on vectors of length WN 
faster than in time c*log(N), turns out to be 
incorrect. To see why, consider the computation 
of a typical inner product required, 


that a CG 
cannot be done 


Cua ae 
By the formulas above, rn) is given as 
(n) _ _(n-1) (n-1) 
r er ae r-1 4? > 
Now suppose we know r{®-!) and pf8-)) but 
not the value of A -1° In this case we would be 
unable to evaluate (rf - 69) ) | but we could 
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work involved in 
Specifically, we 


still perform most of the 
evaluating this inner product. 
can write the recurrence 


(e6™ ao) e Ce fal) ) 


(Eg) 


i 2 Coa ae 
n-l 
and can proceed to evaluate all inner products on 
the right here. If subsequently someone told us 
the value of Ant we could compute the value of 


(rf) CD) 


very rapidly, since only a few more 
real operations would then be needed to complete 
evaluation of the recurrence relation. 


It ts easy to see how this idea can be used 
to speed the computation of the CG algorithm on 
parallel computers. We have replaced an inner 
product computation requiring data not present 
until iteratin mn with inner products of vectors 
present at interation n-l. Since these vectors 
are present sooner, we have that much longer to 
perform their inner products, to achieve the same 
parallel computation speed. Stated differently, 
assuming only the inner products limit the speed 
of the computation, the use of this recurrence 


(x6) 5) ) 


(p6™ Ap ™ ) 


double the parallel speed of CG iteration, where 
it is assumed that sufficiently many processors 
are available, and communications cost can be 
neglected. 


and the 
will 


relation for analogous 


relation for approximately 


Recurrence Relations 


The recurrence relation just described is one 


inner 


product 


Figure l. 


of a large class of such relations which can be 
exploited to speed up CG iteration. These 
relations will be given in detail in Van Rosendale 
[1983], but for now we consider only the general 
form of such recurrence relations. Consider the 
typical inner product: 


Ce 


The value of this inner product may be given in 
terms of the values of inner products of vectors 
occurring at any previous iteration together with 
the values of the real parameters 


a a eee 
n-1? n-2° ; 


Aye? n-2? 


For example, for any k > 0, one can derive 


recurrence relations of the form 


2k 
(eo) _ ) . 


Cee ee 
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The coefficients fa } ia ion! 


; occurring here 
are polynomials in tite parameters 


prog 


[op 9 ye cp a c08 


Similar recurrence relations are available for the 
pcoduct 


other type of inner 


Lteration, (a apt? 1 


occurring in CG 


recurrence recurrence 


relations relations 


calculations 


Principal Data Movement in New CG Algorithm. 


New Algorithm 


To construct a more parallel variant of CG 
iteration based on these recurrence relations, one 
begins by selecting a value for the constant k, 
which may be thought of as a _é look-ahead 
parameter. Then at iteration n - k, when vectors 


y in-k) and genes) become available one begins 
forming all of the inner products 


(fA) gtp(r"k)) seo, t, e+e, 2k, 


er ae), gos, 


(tag), i=0,1,°°*,2k. 


The values of these inner products are needed in 
the recurrence relations for the inner products 


Cag as eae 


at iteration on. Thus we arrive at an algorithm 
whose data movements are sketched in Figure l. 


Clearly the problems of the delays caused by 
the summations in the inner products is now 
solved. If we chose k = log(N), the inner 
product summation delays will have no inpact on 
algorithm speed. However, two new issues now 
arise. First, we have not dealt with the way in 
which the parameters 


r 


1 prt peta AL pnt gal 


enter into the recurrence’ relations. In 
principle, there could be severe data dependencies 
heree Second, there seem to be a large number of 
inner products required now, most involving a 
relatively high power of the matrix A. 


Neither of these problems is as serious as it 
first appears. For the first, it turns out the 


la, Hie, He,} in the 


relations above are polynomials in the parameters 


coefficients recurrence 


ood 


eee r e@ 
{oy 2% o> Maa? pet? ae" ae 


which are at most quadratic 
separately. This fact, 
observation that the parameters 


in each parameter 
coupled with the 


Lo x e@ee0e 
Oak? nak]? yA ak? n-k+1’ 


gradually become available, enables us to 
effectively perform the coefficient evaluations in 
a pipelined fashion. Thus at iteration n, when 
Ce 


we need the inner product » we can 
have the recurrence relation (*) completely 
evaluated, except for performing these summations, 
or the analogous summations in the recurrence for 


Gag?) This requires parallel time 
log(k) = log(log(N)). 


The second problem mentioned above, the 
occurence of high powers of the matrix A in the 


recurrence realtion (*), can be resolved by the 
use of additonal recurrence relations. First, 
observe that there is no need to compute powers of 
the matrix A, since we have the recurrences: 


ai, (™ = ai, (a-l) z ace po 

ty (™) = at, (™) + a Keg cne 
Thus the set of vectors Ca aaa and 
(ee can be updated with only one matrix 


vector product. 


Next observe that nearly all of the inner 
products needed can also be obtained by 
recurrences. We have 


(6 at) Ce es 


n de ee) 
re \2 Chae sake mee) 


n-l 2 


and similar recurrences for the other types of 
inner products occurring in relation (*). “%iven 
the values of the inner products 
n i (n))2k 
je ty 
n) ,i_(n))2k 
{x A p Jae 
n 1 (mn) )2k 
aaa grind ae 
i=) 
at iteration n, we can obtain nearly all of the 


inner products needed at iteration ntl. Only two 
inner products need to be computed directly. 


Computational Complexity 


As pointed out above, the summations in the 
recurrence relations (*) require time 
log(k) = log(log(N)). 


A has at most d  nonzeroes per 
this algorithm requires parallel 


Thus if matrix 
row or column, 
time 


max(log(d),log(log(N)) }. 


The sequential complexity of this algorithm is 
essentially the same as that of the usual CG 
algorithm; we still need two inner products and a 
matrix vector product at every iteration. 
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ABSTRACT 


In our former paper [6] a parallel and pipelined 
fast matrix equation solver employing the conventional 
Gauss Jordan Elimination Method in GF(2) has been pro- 
posed where the elements are Os or 1s. In this paper 
two new solvers employing Cramer with Chio method [2] 
are proposed which are more suitable for VLSI implemen- 
tation. The new solvers have a much more flexible ex- 
pandability toward the increase of the matrix size. 
O(n’) gates are required for realizing an n-pyramid 
solver through solving 

AX =hb 
where A is a regular matrix of nxn, X and b are vectors 
respectively. The solvers can be applied to the real- 
time decryption [1JL3)C4I05], hashing, error 
correction/detection [1], and so on, 


1. INTRODUCTION 


In our former paper [6] an ultra high speed solver 
for the regular matrix equations in GF(2)* has been 
proposed where the elements are Os or 1s. The parallel 
and pipelined solver composed of the iterative logic 
circuits which are suitable for VLSI implementation can 
be realized by employing the conventional Gauss Jordan 
Elimination Method [6]. However, the solver has a 
drawback on flexibility in expanding hardware logic 
circuits according to the increase of the matrix size 
[6]. 

In this paper two new solvers employing Cramer 
with Chio Method are proposed which can overcome the 
drawback. The design of the new solvers is discussed 
from the viewpoints of the total number of gates and 
that of gate stages for computation. The proposed 
solvers can be applied to the decryption of encrypted 
codes [3], and hashing, and so on. 

In order to deerypt a polynomial from multiresidue 
polynomials in GF(2), a regular matrix equation 

AX =hb 
has to be solved where A is a regular matrix of nxn, X 
and b are vectors respectively. In the next section an 
example of Cramer with Chio method is described. 


* Note that GF(2) means Galois Fields 2. 
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2. CRAMER WITH CHIO METHOD 


Cramer with Chio method [2] can be used for solv- 
ing the regular matrix equation 


AX =hb 

where A is a regular matrix of nxn, X is a vector of 
X = (X, XX3 -°-°-+ Xp)t 

and fb is also a vector of 
b = (bj byb3 ***b,)s 


It is believed that the Cramer scheme is unsuitable for 
the large size of matrices because the number of compu- 
tations becomes too large. When Cramer with Chio 
method in GF(2) is employed for solving 
AX =h, 

the Cramer scheme is attractive because the number of 
computations can be drastically reduced by Chio method 
> O(n!) = O(n*), and the basic operations become sim- 
ple in GF(2); multiplication and addition correspond to 
AND and Exclusive OR functions in GF(2) respectively. 

A simple example of Cramer with Chio method is 
shown as follows: 


< example 1 > 
Find X = (x x x_) where 
yo? 3 . 
fo 1 1 Vf 1 
1 
1 } 0 xX = 0 
2 
0 1 O }ix 1 
3 
Solution: 
x = 1 1 1 = J ] 1 = 1 0 = 1 
1 
0 1 0 0 1 0 0 1 
1 1 0 0 0 1 
aw 
x — 0) 1 1 = 1 0 1 = 1 0 1 = 1 0 = 1 
2 
1 oO OQ 0.61 0 0 1 0 Oo 1 
0 1 0 1 0 0 0 0 1 
aa. 
> a 0 1 1 = 1 0) 1 = 1 0 1 = 1 1 4 0 
3 
] 1 0 1 ] 0 0 1 1 0 0 
0 1 1 1 0 1 0 0 0 


* column exchange 


3. 3-DIMENSIONAL MATRIX EQUATION SOLVER 


A 3-dimensional matrix equation solver employing 
Cramer with Chio method is proposed in this section. 
The 3-dimensional parallel and pipelined solver for 
_ Solving . 

AX sb ; 
where A is a regular matrix of nxn is composed of n py- 
ramids as shown in Fig.1. A pyramid which can produce 
one element of vector « consists of n-stage pipes. A 
“single pipe of the ith stage shown in Fig.2 is composed 
of a panel of k*k D Flip Flops where k is n-it+1l, a 1- 
detector, a column exchanger, and a logic unit shown in 
Fig.3-6 respectively. The 1-detector in the ith stage 
detects the leftmost element to be 1 and produces the 
row number of the element for exchanging the leftmost 
column with the column of the row number. The row 
number corresponds to the output of case i in Fig.4. 
The column exchanger exchanges the required two 
columns. The logic unit described in Fig.6 calculates 
the arithmetic operations in GF(2). In GF(2) the addi- 
tion and subtraction corresponds to the function of Ex- 
Clusive OR logic and the multiplication to that of AND 


logic. The division corresponds to no-operation when 


the divisor is 1. In order to find X for 


Oy 442 --- By» An 

an A972 °°: Do... Q9n 

X, = Dt 
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ith. column 


the required data flows from bottom to top through a 
pyramid and the x, can be obtained as shown in Fig.1. 
The proposed solver is suitable for VLSI implementation 
because of a high regularity of iterative logic cir- 
cuits. The solver has a flexible expandability to the 
increase of a matrix size. The solver for solving a 
regular matrix equation of 
AX = hb 

- where A is a regular matrix of (n+1)x (n+1), can be 
easily realized by appending the (n+1)th pipe to the 
bottom of a pyramid composed of n-stage pipes. 

The total number of required gates for realizing a 
matrix solver to solve 

AX =hb 

where A is a regular matrix of nxn, is shown in Table l 
and Fig.7. The total number of required gates and that 
of required gate stage for solving a regular matrix 
equation are described in Fig.7. O(n‘) gates are re- 
quired for realizing a n-pyramid matrix solver where n 
is the size of vector X. | 
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Fig.1 Cramer with Chio method 
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column 


k = n-itl 


2 A single pipe of the ith stage 
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k=an-i+l 


clock 


case] case2 case3 <+--:--:-> case k 


Fig.4 A l-detector of the ith stage 


oa 


Fig.6 A logic unit of the ith stage 


Table 1 The number of required 


Corum 
EXCHANGER 
LoGiC UNIT 
4i2 - (@nt5)i 242 ~ (4nt3)4 
TOTAL delay spilt 
+ (4n2+5n+1) + (2n?+3n+1) 


AMOUNT OF WHOLE STAGES 


The nunber of required gates = 


49 


THE NUMBER OF REQUIRED GATES 
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Fig./ Evaluation of the matrix solvers 
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Fig.9 Attached hardware for data iteration 


4, RECURRING MATRIX SOLVER 


A simplified recurring matrix solver is proposed 
in this section. The recurring solver for solving 
AX = hb 
where A is a regular matrix of nxn, is composed of a 
single pipe together with an attached hardware shown in 
Fig.8. Once the required data is set, the elements 
circulate from bottom to top where the attached 
hardware synchronously masks the needless elements for 
calculation. An example of masking is shown in Fig.10. 
The attached hardware is composed of an (n-1)- 
stage D Flip Flop circuit and AND-logic circuits shown 
in Fig.9. Initial seed of the D Flip Flop circuit is 
(0, 9,19,9 -- Q] )=(011--- 1), 
and in the next clock 
(Q, 09,2 - Q, )=(001--+ 1). 
The recurring matrix solver can drastically reduce the 
amount of required hardware for solving a regular ma- 
trix equation as shown in Fig.7. O(n* ) gates are re- 
quired for realizing a recurring matrix solver where n 
is the size of vector X. The number of required gate- 
stage is also shown in Fig.7. 


5. CONCLUSION 


The matrix equation in GF(2) should be solved in 
various application fields [6]. ‘Two new matrix equa- 
tion solvers in GF(2) employing Cramer with Chio method 
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Fig.10 An example of masking 


were proposed. The one is an n-pyramid solver where n 
is the size of vector X¥. A pyramid is composed of n- 
stage pipes. The other is a recurring matrix soiver 


- employing only a single pipe for saving the hardware 


amount. The total number of required gates for realize 
ing the solvers and that of gate-stage are discussed. 
The proposed solvers can be implemented by 3- 
dimensional VLSI circuits because of a high regularity 
of the iterative logic. The solvers have a flexible 
expandability to the increase of the matrix size. 
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ABSTRACT 


Four 
requirements 


distinct packet communication 

for network architectured computer 
systems are: system control, dataflow data type 
movement, SIMD, data realignment and movement of 
high volume data between MIMD configurations when 
memory sharing iS unavailable or too costly. 
This paper defines and describes a packet 
Switching mechanism which meets’ each of these 
requirements. Mechanisms are also defined and 
described for breaking and restoring SIMD 
execution structures which are required to 
complete the implementation of packet switching 
for SIMD execution. The mechanisms were defined 
and are described in the context of the Texas 
Reconfigurable Array Computer (TRAC), but should 
be in large measure adaptable to other network 
architectured systems. 


1.0 PROBLEM STATEMENT AND OVERVIEW 


A computer system incorporating a multistage 
netowrk can utilize either packet switching or 
circuit switching or both. The choice of modes 
may depend upon other architectural factors, such 
as whether’ the network couples processing 
elements to memories or processors to processors, 
or whether the system model of computation is 
SIMD or MIMD, and also may depend upon the 
selection of problems the system is intended to 
execute. 


The Texas Reconfigurable Array Computer 
(TRAC) [SEJ80] uses its network to create 
configurations by coupling processing elements to 
memories to form processors and then coupling 
processors to form SIMD tasks, if desired 
[BRO82]. Tasks, whether SISD or SIMD, run 
independently from each other in MIMD fashion. 
TRAC is thus capable of SIMD and MIMD models of 
computation. The varistructure capabilities 
[LIP77] of the architecture are implemented 
through carry look-ahead techniques which are 
Similar to SIMD synchronization techniques. The 
breadth of representation capability of TRAC has 


forced us into a thorough analysis of the 
requirements of network architectured systems for 
the packet Switching mode of network 


communication. 
this study. 


This paper reports the results of 


There are four distinct communication 
requirements for network architectured systems 
which can be met by appropriate packet switching 
facilities. These are: 
and 


1. system control (both inter 


intra-task communication) 
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2. data movement for dataflow models of 


computation (inter-task communication) 
3. data realignment for SIMD processing of 
arrays (intra-task communication) 
and 


MIMD 
circuit 
when the 


4, movement of data between 
configurations when either 
SWitching is not provided, or 
Switch configuration required for 
establishment of circuits, needed to 
link specific process/data pairs, cannot 
be realized or realized only with 
excessive reorganization cost 

(inter—task communication) 


All of these communication requirements have 
been previously recognized and discussed in the 
literature. This paper defines a coherent 
integrated implementation of packet switching 
which serves all of these requirements. Some of 
the mechanisms and implementation schemes have 
been previously reported [TRI79,PRE79]. The 
overlap with this integrated discussion will be 
noted in the text at the appropriate places. 


Commonly, packets will be sent to _ some 


subset of the processors in an SIMD task. The 
selection of packet switching as a mode of 
communication thus requires an efficient 


mechanism for desynchronizing and resynchronizing 
SIMD tasks. Two such mechanisms are implemented 
as a part of the communication functionality for 


TRAC. This capability is particularly 
significant for TRAC since tasks of data width 
greater than a single byte may be constructed 


through SIMD synchronization of single byte wide 
processors. However, both should be adaptable to 
other architectures. ' 


The next section states the requirements in 


the context of the TRAC architecture. Section 3 
gives an overview of the packet Switching 
capabilities and their implementation. Section 4 
defines the desynchronizing and resynchronizing 
functions and their implementation. Section 5 
describes the implementation scheme including 
software requirements. Section 6 analyzes the 


design space for implementation and justifies the 
selections. 


2.0 REQUIREMENTS FOR PACKET SWITCHING FUNCTION- 
ALITY IN TRAC — 


Although TRAC has been previously described 
[SEJ80,PRE79] we review the background for packet 


communication. We emphasize that software design 


and application studies have been conducted in 
parallel and sometimes in advance of definition 
and implementation of architectural features. 
The integrated packet communication facilities 
are an example of the interplay of requirements 
and architecture. 

TRAC uses circuit switching to establish 
task spaces (partitions) of resources conforming 
to some desired model of computation. The tasks 
may be SIMD or SISD. Partitions run 


independently of each other in MIMD fashion. A 
circuit switching mode of communication and data 
movement is available within a task space. 
Circuits are used as broadcast buses for 
instructions and carry linkage. They are also 
used to implement explicit sharing of memory 
between processors, since any one of the 
processors can explicitly detach a memory, and 
another can explicitly select this memory in a 
single instruction. Reconfiguration of the 
network can also be used to move memory units and 
thus data between configurations. There would, 
however, be many restrictions on possible usage 
of the system if these circuit switch functions 
were all that were available for interprocess 
communication. System control functions could be 
executed only by extensive reconfiguration. It 
would be awkward to implement the dataflow model 
of computation, for more than a small number of 
processors (the number which can effectively 
communicate through a Single switchable memory 
unit). Data realignment between phases of SIMD 
computations (such as transposing a matrix) would 
have to be serialized. Network blocking would 
restrict the set of configurations which could 
realize Sharing among configurations and be 
useful for MIMD processing. On the other hand, 
packets provide complete processor—processor 
communication capability. However, the 
disadvantage of packet mechanism is that the 
packets arrive at the receiving process at a 
non—deterministic time, thereby forcing a set of 
receiving processors out of synchronisms, thus 
requiriing resynchronism overhead. 


Each of these functions 
different capabilities 
communication mechanism. 
small interrupting 


requires somewhat 
from the packet 
System control requires 
packets. Dataflow and 
realignment requires carrying of addresses 
together with data. General data movement 
between MIMD configurations requires efficient 
transmission of large volumes of data. 


The mechanisms described in succeeding 
sections integrate implementation of all of these 
requirements. The succeeding discussion 
clarifies the mechanisms as inter- or intra-task 
depending on their functions. 


3.0 OVERVIEW OF TRAC'S PACKET SUPPORT 


On TRAC each task is assigned its own task 
space of processors, memories and I/O devices 
that are interconnected as required. One 
processor of each task space is called its 
task-head. This processor is chosen during task 
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set up. time and implements the required 
synchronization, security, authorization 
mechanism and other operating system functions 
needed to support inter/intra task packet 
communication. When any task executes, the 
processors assigned to it execute in SIMD mode 
(in lock step). Therefore the processors 


assigned to this task are simultaneously informed 
about any intra-task communication. But for any 
inter-task packet communication, a sending task 
needs to inform the receiving task of its desire 
to communicate. This is done by the use of an 
inter-task protocol implemented strictly through 
interrupting packets (described below). 


On TRAC, packets can be transmitted within a 
task space and between task spaces. The packet 
network provides full interconnection between the 
'w' processors in the system. The SW—banyan 
network provides this communication using less 
than w**2 links; this implies that links must be 
shared. Transmission delays dependent on traffic 
patterns may be introduced by blocking in the 
network. 


The addition of a 
and forward) network 
only marginally. This is due to the 
the switch can be time multiplexed. The time 
slice devoted to packet forwarding on the switch 
corresponds to the period where primary memory is 
executing a read or a write cycle (i.e. when it 


packet switching (store 
increases the switch cost 
fact that 


cannot receive data nor send data on the circuit 
switch "bus"™). The extra hardware needed 
consists of some control circuitry, and buffers 


to provide the store and forward function for the 
packet network. Thus this hardware requires some 
additional logic (chip area) which is relatively 
inexpensive. Further it does not require any 
additional pins, since the pins already used for 
circuit switching are data multiplexed. 


A packet consists of an address header and a 
number of data words. The addressing scheme for 
directing the packet movement in the banyan 
network is the one suggested by Tripathi and 
Lipovski [TRI79]. An improved method has been 
suggested by Siegel and McMillen [SIE81]. A 
processor Pi transmits a packet’ to another 
processor Pj by first loading relevant data into 
a packet generating unit, in a designated primary 
memory module in its private memory ensemble. As 
links and nodes become available the packet 
proceeds towards the target processor along a 
unique path. Some form of arbitration is 
provided at the switch nodes to resolve any 
packet conflicts. The details of this store and 
forward hardware on TRAC are given in [SEJ81]. 


The inter- and intra—task 
communication use this packet network. 
primary difference between intra- and 
packets is the amount of context switching 
required for their reception at a target 
processor. Packets which cause interrupts upon 
reception are called interrupting packets, while 
packets which are explicitly sent and explicitly 
received are called mapping packets. 


packet 
The 
inter-—task 


Interrupting packets are used to transmit 
small amounts of control information. Mapping 
packets are used when movement of substantial 
amounts of data is required, but a small amount 
of controlling data movement is needed to 
initiate the activity. These two types of 
packets use the packet network at different times 
Cisec, the network is time—multiplexed). 
Therefore, in effect we have an interrupting 
packet channel and a mapping packet channel. 


Inter-task mapping packet communication is 
initiated by following a "request-acknowledge" 


protocol. This protocol requires that the 
sending task's task head send an interrupting 
packet to the receiving task's task head. This 


packet will indicate the desire for multi-byte 
wide data communication between these two tasks 
(since there are multiple processors in either 
task). On receiving such an interrupting packet, 
the receiving task head is interrupted along with 
all other processors of the receiving task. This 
task head then reviews and validates this 
interrupting packet and sends an "acknowledge" 
packet to the requesting task head. It then 
informs other processors in the receiving task to 
execute a "receive MAP" instruction. The sending 
task's task head receives the acknowledge and 
instructs its task's processors to execute a 
"send MAP" instruction. These "MAP" instructions 
are executed by microcode in the processors. 
Both instructions have a parameter, which is’ the 
number of bytes to be sent or received. After 
the multi-byte wide data communication is 


initiated, the processors of the sending and 
receiving tasks operate at their own pace. A 
processor sendS a packet whenever its packet 
input port is free, and/or receives one when it 
arrives. This communication is terminated when 
the send/receive counts in the respective tasks! 
processors reach zero. Resynchronization to SIMD 
mode is needed because each processor can finish 
its MAP instruction at different times (due to 
blocking in the network). This synchronization 
of the processors is achieved by mechanisms 
defined later. 


Intra-task mapping packet communication does 
not require this "request—acknowledge" protocol. 
This is because the processors of the task are 
Operating in lock step and therefore initiate 
this communication. together. This is done by 
initiating the execution of a "send MAP/receive 
MAP" instruction. This instruction handles the 
required sending and receiving of mapping 
packets. It executes in a similar manner as’ the 
"send/receive MAP" instructions described above. 


To support dataflow implementation on TRAC, 
we might need multiple tasks to send to a single 


receiving task simultaneously. Since we cannot 
ensure that these sending tasks will enter the 
MAP function at the same time, we need to 


implement the MAP instructions so that they can 
be interrupted by another sending task wishing to 
join the MAP function. This allows the sending 
tasks to join the MAP function "asynchronously" 
(they still have to follow the 
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"request—acknowledge" protocol described 
preceding). The advantage gained by this scheme 
is that the sending tasks are made to wait for 
the least amount of time (time required for a 
transmit acknowledge). The two other’ schemes 
are : 1) to completely serialize the senders in 
the request order, and 2) to enable transmission 
(from all the senders) only after the last sender 
has requested transmit permission. The 
disadvantage with the first scheme is that it 
does not allow parallel communication and_ good 
dataflow implementation. The disadvantage in the 
second alternative scheme is that new senders 
have to wait upon the last sender. Another 
possible problem with this scheme is that the 
Senders may have to wait indefinitely, if any one 
of them does not join the MAP request. 


TRAC packets are in fact trains 
wide words and 8 bytes long. 
formats for mapping packets. These formats 


(Fig. 1 and 2) allow us to communicate 
information for all the applications cited above. 
In the first format (called an address/data 
packet format) the memory address in the 
destination processor, where the data will be 
stored, always accompanies the data. Here we can 
either send one byte of data (Fig. 1(a)), or two 
bytes of data per packet (Fig. 1(b)). If we send 


of 1 byte 
There are two basic 


Byte # 


MSB 4 bits 
year ae ne ane 


LSB 4 bits 


Pkt. Typ. Info. Dest. Proc. ID 


Dest. Addr. Hizh byte 
Dest. Addr. Low byte 
Data byte 


+ 
(Security /Authorization 


Information) 
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(a) Address-data packet format type-1l 


Pkt. Typ. Info. 


Ist Dest. Addr. High byte 


Dest. Proc. ID 


[st Dest. Addr. Low byte 
Ist Data byte 
2nd Dest. Addr. digh byre 
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2nd Dest. Addr. Low byte 


2nd Data byte 
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(Security/Authorization Information) | 


(b) Address-data packet format type-~2 


Figure 1: Address-Data Packet Formats 


Pkt. Typ. Info. 


| 


Data/Control 


Dest. Proc. ID 


Information 


ss A MW FF WY HY HF 


Figure 2: Data/Control Information 
Packet Format 


only one byte of data per packet, then the 
remaining three bytes in the packet can contain 
any security/authorization information if needed. 
In the second format (Fig. 2) the packet contains 
6 bytes of data/control information. The 
microcode in the receiving task implicitly knows 
the destination address for placing the packet 
data, or it is informed of this address prior to 
receiving this information. The microcode need 
not be aware of the contents of the packet data. 
This information is analyzed by the local higher 
level software construct. The format of Fig. 2 
is also used for interrupting packets. 

There is the need for the simultaneous’ use 
of both packet formats for mapping packets on 
TRAC. The address—data format of Fig. 1 looks 
attractive for supporting dataflow algorithms and 
data realignment in SIMD tasks. While the 
data/control packet format of Fig. 2 may be more 
suitable for transferring "large" amounts of data 
between tasks. Looking at the two formats we see 
that the formats in Fig. 1 are special cases of 
the format in Fig. 2. The two formats must be 
distinguished because these two formats require 
different microcode (or hardware) for handling 
them. 


Some information will be needed in each 
packet to indicate its format. The first byte of 
a packet contains 4 bits of destination processor 
ID, which is required to route the packet through 
the banyan to one of the sixteen processors (in 
the current configuration of TRAC). The other 4 
bits of this byte are used to indicate the packet 
type information required to indicate the 
packet's format to the microcode. This also 


facilitates the reception of multi-typed packets 
concurrently. For example tasks A and B= send 
data simultaneously to task C using mapping 


packets. Then it is possible that task A_ sends 
packets of the type shown in Fig. 1(a) and task B 
sends packets of the type shown in Fig. 2. The 
microcode is thus required to recognize the 
packet type for each packet received. 


4.0 RESYNCHRONIZATION MECHANISMS 


The protocols for executing packet transfers 
require the use of the desynchronizing and 
resynchronizing mechanisms. The processors of a 
SIMD task are desynchronized when a _ task is 
broken into subtasks, or during exception 
handling, or during the execution of an 
instruction that requires the processors of a 
task to operate independently (e.g., "MAP" 
instruction execution). There are two 
resynchronization mechanisms. One is intended to 
reassemble the sub-tasks into the original task 
structure when all the sub-tasks have terminated. 
The second is designed for use when a_ Single 
processor requires to interrupt the synchronous 
execution of a single task. This mechanism can 
also be used to reassemble the original task 
structure. Both of these resynchronization 
mechanisms are supported in hardware. 
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4.1  Resynchronization Mechanism —- A 

This mechanism is used to synchronize the 
Sub—tasks into their original task, when the 
sub-tasks have terminated. It is used only when 
the circuit switched path linking the processors 
(called the instruction tree) of the task has 
been broken to create the sub-—tasks. The 
instruction tree is a tree-shaped broadcast path, 
rooted at a memory module, and can be recreated 
by a Single instruction. The logic elements of 
the instruction tree of interest here are 
illustrated in Figure 3. 


After breaking a task into sub-tasks, the 
task head processor places the count of the 
number of processors to be resynchronized in the 
synchronization register in the memory module at 
the root of the instruction tree. After 
finishing asynchronous processing each processor 
acquires this memory module and decrements’ the 
count by 1. Mutual exclusion for such an access 
is provided by hardware in the interconnection 
network. If the count is non-zero the processor 
Simply activates the part of the instruction tree 
that links it to the memory module at the root. 
It then executes an arithmetic operation so that 
it asserts a true propagate, but does not produce 


a generate in the carry-lookahead logic that is 
part of the instruction tree (Fig. 3). The 
processor then waits for the incoming carry to 


become true. The last processor to acquire the 


Memory module at the root of 
the instruction tree 


Generate signai wa Propagate signal 


Carry signal 


Carry~Lookahead 
logic signals 
following the 
instruction tree 


Switch modules 


Processor modules connected by 
the instruction tree 


Resynchronization Mechanism-1 


Figure 3 
(Only the carry~lookahead signals following 
the instruction tree are shown here.) 


zero count in the 
It creates its branch 


shared memory module sees a 
synchronization register. 

of the instruction tree and then executes an 
arithmetic operation to assert the generate 
Signal. All the processors of the original task 
see a carry due to this operation. This 
indicates that the original instruction tree has 
been recreated and that the processors are 
resynchronized. : | 


4.2 Resynchronization Mechanism - B 


The following hardware mechanism is_ used 
when a Single processor wants to communicate an 
asynchronous event to the other processors of the 


desynchronized tasks. It can also be used to 
bring the processors back in synchrony. This 
mechanism can be used when the instruction tree 


is active or has been deactivated. 


When an asynchronous event which has to be 
communicated to the rest of the processors occurs 


in any one of the processors of a task, the 
processor recognizing the event, asserts a 
Signal-A via a tree shaped single line (see 


Figures 4 and 5). This signal goes down to the 
root of the tree over line-A, gets turned around 
and comes back up along the broadcast tree to all 


the task's processors over line-B. This signal 
is recorded in all the processors, including the 
asserting processor, by setting of the SYNC 
flip-flop. 

As each processor completes its current 
"atomic" operation it tests this flip-flop. If 
it is set, it recognizes that an asynchronous 


event has occurred and that it needs to get back 


into lock step with its companion task 
processors. If the SYNC flip-flop is not set, it 
continues to execute the next "atomic" operation 


independently. On recognizing the occurrence of 
an asynchronous event it clears the SYNC 
flip-flop and waits for a wire-AND line (Fig. 5) 
to become TRUE. This line connects all the 
processors in a task. 


While in an asynchronous mode of operation 
the inverted output of the SYNC flip-flop is fed 
onto this wire-AND line. Before entering the 
asynchronous mode this flip-flop is cleared and 


the wire-AND line is set TRUE. Whenever any of 
the SYNC flip-flops attached to this wire-AND 
line is set, this line is set FALSE. Thus’) after 
the asynchronous event has occurred and this 
wire-AND line is TRUE, it is known that’ the 
processors once again are in lock step. They 
then proceed ahead in lock step to service the 
event. 


Processor 


SYNC 
flip-flop 


Asserted <»—' 
Line-A 


ey Root of broadcast tree 


Figure 4: Resynchronization Hardware 
‘echanism 


55 


[oy 
fe) 
ed 
Ue 
we & 8 
fea 
a ae 
4 
2 eo eee 
8 re) 
Me z 
a 


Fiy ure 5 - Wire and Logic 


The "MAP" instruction's microcode uses part 
of the second resynchronization mechanism to 
resynchronize the original task's processors. An 
overview of the "MAP" function's microcode 
operations and the use of the resynchronization 
mechanism by a processor is given in Fig. 6. 


Further analysis of the resynchronization 
hardware is very much dependent on the processor 
implementation. The implementation in TRAC is 
described in a report [RAT83]. 


5.0 IMPLEMENTATION OF THE MAP FUNCTION 


In this section we discuss’ the 
data Structures, operations 
mechanisms required for’ the 

function. Here we state 
requirements and indicate the sequence required 
of the microcode to Support such data 
communication. Then we propose a communication 
scheme to transmit the information required by 
the receiver. 


Operands, 
and hardware 
sender in a MAP 

only the basic 


The two cases of multi-byte communication 
are the intra- and inter- task data 
communication/realignment. Their initiation and 
execution sequence has been discussed in Section 
ox 


In the 
described 


two 
above 


cases of task communication 
we found that we need a 


resynchronization mechanism at the end of the MAP 
function to bring the processors of the original 
task back in lock step. The resynchronization 
mechanism described in Section 4.2 is used for 
this purpose. 


\ 
MAP fucntion initiated j 


Set SYNC 
flip-flop 


| 


Transmit/receive a packet | 


Decrement required counts | 


Send count = 0 
and 
Receive count =. 


} Do the necessary crerations | 
end initialization/urdactes } 
aisles ein 


Set SYNC flip-flor: 


. 
Get back in lock step anda \ 
continue task execution / 


Figure 6: Overview of the use of the 
hardware resynrchronization mechanism 


to terminate the MAP function 


5.1 THE MAP FUNCTION SENDER 

Regardless of the mapping packet format used 
by any two communicating tasks the microcode 
requires the following minimum information for 
implementing the MAP function : 


1. Send Count : Number of 
processor will transmit. 


packets a 


2. Receive Count : Number 
processor will receive. 


of packets a 


3.. Destination Processor IDs : This is 
required by the packet network to route 
the packet to the required processor. 


4, A pointer to the memory of the sender 
task where the data to be transmitted is 
located. 

5. A pointer to the memory of the 

destination processor where the data is 

to be stored. This may be a_ specified 
index register or an absolute address in 
the receiving task's space. 


Some additional information may also be required 
to allow the operating system to exercise its 
data security and authorization mechanisms. 
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This information can be provided by one of 
the following three schemes: 


1. Load time binding : Here a compiler 
creates the template for the 
transformation or the data 


communication, and the loader binds the 
addresses, destination processor numbers 


IDs, and the send and receive counts. 
This is done for all communicating 
tasks. 

2. Dynamic binding The interrupting 
packet and/or the mapping packets are 


used to communicate this information. 


3. <A combination of the above two schemes. 


The first scheme is’ suitable for the 
address-data packet format (Fig. 1), while the 
second/third scheme seems more suitable for the 


data/control information packet format (Fig. 2). 


5.2 COMMUNICATION TO THE RECEIVER 


We assume 
counts, 


here that the required send 
destination processor IDs and pointer to 
the data to be sent, have been specified to the 
sending task's processors in some fashion. We 
are concerned here with the specification of the 
packet receive counts for the receiving task. 


The receiving task receives an interrupting 
"request" packet when a sender wants to 
communicate with it. The task head recognizes 


this packet, and after checking authorization and 
security of this communication, it acknowledges 
the sender appropriately. If mapping packets are 
to be received it must inform its task's 
processors of this. After the processors of the 
receiving task are made aware of a sender, they 
initialize their packet receive counts and then 
Start reading the MAP packets. It is possible 
that while servicing one sender other senders 
also may desire communication. As each new 
sender is allowed to communicate, all the 
processors of the receiving task must’ update 
their packet receive counts appropriately. 
Therefore a scheme is needed to allow us to do 
this initializing/updating of the packet receive 
counts properly. 


Four schemes were considered. They differed 
in whether the receiver would be given 
information at load time, by means of 
interrupting packets or by means of the mapping 
packet mechanism. They all required the sender 


have descriptors and templates set up at load 
time, requiring the loader to execute two passes 
to create and bind the operands for a MAP 
function. A detailed analysis of these schemes 
is available in a report [RAT83]. 

The scheme we have selected for 
implementation assumes that the MAP functions! 


"receive operands" are bound at load time in the 
sending task's space. When the sending task 
desires to execute a MAP function it must 


communicate the necessary "receive operands" to 
the receiving task dynamically, that is the 
"receive operands" are sent during execution 
time. These operands are sent over the mapping 
packet channel. 


The actual 
follows :- 


sequence of operations is as 


sends a 
to the 


1. The sending task's task head 
"request" interrupt packet 
receiving task's task head. 


validates this 
appropriate 


2. The receiving task head 
request and sends an 
acknowledgement. 


3. On receiving a "transmit—enable" 
acknowledge the sending task head 
directs its task's processors to start 
MAP packet communication. 

4. Each sending task processor first sends 

a special "receive operand" mapping 

packet. This packet's type is set to 

indicate its contents. After sending 
this packet it continues to send the 
required data MAP packets. 


5. Each receiving task's 
receiving a packet, checks the packet 
type. If it contains "receive operand" 
information it updates its receive count 
(maintained in its working registers) 
and destination address (if any) using 
the packet data. Otherwise on receiving 
a data packet it stores the data at the 
required location. 


processor on 


6.0 ANALYSIS OF TRAC IMPLEMENTATION 


This section completes the definition of the 
integrated packet Switching mechanism and 
discusses the selections among design 
alternatives. 


The packet formats (Figs. 1 and 2) must both 
be supported. Each is suitable for different 
applications (as cited in Section 3). When 
handling MAP packets the microcode must know 
which format it is transmitting or receiving. 
The sending task's processors have to know the 
format so that they can write the packet data 
appropriately, the receiving task's processors 
need to know the format so that they can read the 
contents properly, and also because they need to 
know the destination address to place this data. 
For the address-data packet format (Fig. 1) this 
address is specified in the packet. But for the 
data/control packet format (Fig. 2), the 


receiving task already knows this address 
implicitly, or has been informed of this prior to 


the multi-Byte data communication. This packet 
type information is indicated by 4 bits in the 
first byte of each packet (Fig. 1 and 2). 


It is necessary because of the way the 
destination addresses for the data/control packet 
format (Fig. 2) are specified to restrict the 
number of simultaneous. senders of data/control 
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format MAP packets. While receiving this 
the receiving task's 


format 
processors holds the data 
destination address in an internal working 
register. An internal register needs to be used 
if the microcode has to receive the packet in 
reasonable time. On TRAC the number of internal 
registers available is limited. Therefore when 
multi-byte data communication is done, using MAP 
packets of data/control format, we restrict the 


receiver to receive them from only one sender at 
a time. This Serialization of communication 
leads to loss of parallelism. It does now, 


however, affect the basic application. Such MAP 
packets were assumed to be uSed by the operating 
system and/or by tasks for transmitting "large" 
amounts of data between tasks. They will not be 
used to support dataflow or data realignment. 
Removal of the internal register restriction or 
providing the processor with a greater number of 
internal working registers would allow multiple 
senders to a single receiver increasing potential 
parallelism. 


It is possible to receive address—data 
packets (Fig. 1) from multiple senders, because 
the destination address for the data is specified 
in the packet. The microcode in the receiving 
task's processors does not have to use its 
internal working registers to store this address. 
Therefore there is no restriction imposed by the 
microcode on the number of simultaneous senders 
of address/data packets to a single receiver. 
Dataflow and data realignment can be effectively 
implemented with packets of this format. 


LS is possible that while handling 
address/data senders, another sender requesting 
transmission of a data/control format packet 
requests to join the transmission. It is 


possible for the microcode to receive packets 
from one such sender along with the other 
address-data packet senders. The receiving task 
Will not allow a second data/control packet 
sender to join the transmission, until the 


previous such sender terminates. On terminating, 
the sending task's task-head indicates completion 
by sending an interrupting packet of a special 
format. On receiving this interrupting packet 
the receiving task enables another data/control 
packet sender if any are queued. In order to 
allow Such a mix of packet formats to be 
received, the microcode must to identify the 
packet format for each MAP packet it receives. 


The restrictions of simultaneous packet 
reception can be implemented by the receiving 
task head. Whenever a sending task requests 


mapping packet communication, it will specify the 
packet format to be used. This is specified in 
the "request" interrupting packet sent to the 
receiving task head. This receiving task's task 
head checks its mapping packet status data to see 
if it can allow packet communication in the 
request format. The task head will not allow two 
data/control packet senders to transmit at’ the 
same time. If the task head finds that the 
sender can join the MAP function, it acknowledges 
that sender. If not, the task head either sends 
a "transmit-deny" acknowledge to the sender 


immediately; or it queues this request and sends 
the acknowledge at the end of the current 
data/control format MAP function execution. In 
this case the task head will also have to inform 
its task's processors about this new sender at 
the end of the current MAP function. In the 
first case the sender will try to gain permission 
at a later time. If we choose this option, there 
is a possibility of congesting the interrupting 
channel with request-deny/acknowledge packets. 
This overhead is reduced in the second case Since 
the request is queued and the sender waits to 
receive the transmit acknowledgement. 


The next implementation option to be 
reviewed is the specification of the MAP 
function's operands. These are the send count 


(number of packets to be sent), the destination 
processor ID and the source address of the data 
to be sent, for each processor in the sending 
task. Each processor of the task must know the 
receive counts (number of packets to be received) 
and the destination address of the data. These 
MAP function operands must be specified and bound 
at compile and load time. Special data 
descriptors and data structures (templates) are 
used to store them. The sending task's ‘operands 
are bound in its task's space. 


This scheme somewhat increases the 
complexity of the systems loader. The MAP 
information is bound at load time to the sending 


tasks address space and therefore requires a two 
pass loader. It does not require any space in 
the receiving task to store the "receive 
operands", It thus gives a small storage space 
overhead. Further it does ‘not congest' the 
interrupting packet channel and uses only the 
mapping packet channel to communicate its control 
and data interrupted explicitly to transfer 
"receive operands" it does not require 
resynchronization within the MAP instruction. It 
does, however, require resynchronization to 
terminate the MAP instruction. 


7.0 CONCLUSION 


This paper has defined a multi-purpose 
packet data movement capability for a network 
architectured multiprocessor computer system. It 
has been shown that such capabilities can be 
effectively implemented in an integrated manner, 
and that the packet switching functions are 
compatible with and complimentary to a circuit 
Switching functionality for the network. This 
integrated packet communication system is 
operational in the current four processor, nine 
memory configuration of TRAC. Exploration of the 


design space for implementation gave a clear 
resolution of desirable choices. The 
functionality and the implementation techniques 


are largely independent of the choice of network 
structures. The implementation concepts should 
be broadly applicable to network architectured 
multiprocessors. 
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TIMING CONTROL OF VLSI BASED NLOGN AND CROSSBAR NETWORKS* 
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ABSTRACT 

Two basic data flow control methods for circuit 
Switched, pipelined networks of the general NLogN 
and Crossbar (CB) topologies are modelled and 
their effects on overall data rates achievable are 
determined. The synchronous method uses a_ global 
clock and as network modules grow, clock skew and 
and clock tree charge/discharge times grow 
resulting in lower data rates. The asynchronous 
method relies on local request/acknowledge signals 
to control data movement and whence it’s 
performance is less affected by system growth. 


1.0 Introduction 

Advances in VLSI technology have 
avallable a number of low cost yet 
microprocessor chips. This has led to a 
proposals ([8], [9], [10]) for the 
closely coupled multiple processor 
which a number of processors are 
together by a communications network. 
handles interprocessor communication and enables 
resource sharing. Its design is a key factor in 
determining overall system performance. 

The principal issue of interest in this 
ls the type of control scheme to be used for 
control of data movement in a large circuit 
switched interconnection network environment where 
the network is partitioned into a number of 
subnetwork chips [2], and data transfer through 
the network is pipelined. Two principal methods 
that can be used in controlling data movement 
along the network are referred to as the 
synchronous (or clocked) and the asynchronous (or 
self-timed) schemes [7]. The synchronous control 
scheme has been traditionally favoured, especially 
in small systems, because of its logic design 
simplicity. The presence of global clock signals, 
however, makes such systems difficult to expand 
and as the system grows, system performance may 
deteriorate due to the increases in clock skew. 
The absence of any global signals in an 
asynchronous system makes it inherently modular 
and expandable and hence it becomes attractive in 
systems where the size of the system cannot be 
predicted in advance, where a’‘number of subsystems 
operate independently, or where system size (and 
clock skew) require inordinately large clock 
periods for proper operation. 

The analysis in this paper follows that of 
[ll], focusing here, however, on the data rates 


made 
powerful 
host of 
design of 
systems in 
connected 
The network 


paper 


achievable in both CB and NlogN networks when 
asynchronous and clocked control schemes are 
utilized. The analysis provides a quantitative 
approach to making the CB/NLlogN, 
asynchronous/synchronous design decisions. 
Decision curves are provided for a particular 


example to illustrate the procedure. 
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will not cause the network to malfunction. 


2.0 Protocol Issues 

A complete interconnection network requires 
control provisions for path establishment, 
transfer of data from source to destination, 
detection of a blocked path and indication of end 
of transmission with path clearing. We will 
assume that these requirements are satisfied as 
shown in [2] and that the network has_ been 
partitioned following a bit slice architecture 
approach with one bit per plane. The present 
analysis focuses on the data rates achievable 
after a path has been established from a source to 
a destination. Hence this analysis will hold for 
systems where the average message length is much 
larger than the average number of modules in a 
path in the network, that is, data tranfer time is 
much larger than path establishment time. 

For the asynchronous network, a delay 
insensitive control structure is adopted. That 
is, insertion of arbitrary delay between modules 
Also 
transition sensitive logic is employed. Figure l 
shows the interconnection between two asynchronous 


modules. A transition on Rl indicates the 
presence of a "1" data bit, while a transition on 
RO indicates the presence of a "0" data bit. The 


"A" line supplies the acknowledge response signal. 


Interconnection of two synchronous modules is 
shown in Figure 2. For the synchronous network, 
the standard two-phase level sensitive clock is 


used for the data transfer. Data at the input of 
a module is captured at phase one of the clock and 
is transferred to the output of the module at 
phase two of the clock. 


3.0 
The 


Asynchronous Banyan Delay Model 

Huffman finite state machine 
representation of a logical implementation of the 
asynchronous module is shown in Figure 3. If we 
assume that the environment of the module can 
cause a change at the module input as soon as_ the 
environment receives a change at the module 
output, then it can be shown [1] that the 
sufficient conditions on the various delays to 
achieve race~free operation are given by the 
relations 

dF >= dL (3.1) 

dO >= dF + dL (34.2) 
where dL is the maximum delay of the combinational 
logic. 

A pair of communicating modules i and j in a 
path k is modelled as shown in Figure 4. 
Considering module i, the maximum propagation 
delay from any input to any output of the 
combinational logic is dLi, the propagation delay 
of the feedback path is dFi and the propagation 
delay from module i to module j is dPij. 
Similarly for module j, we have dLj and dFj and 
dPji, the propagation delay in the acknowledge 
path from module j to module 1. The delay from 
the output of the combinational logic of module 1 


through the combinational logic of module j to the 
input of the combinational logic of module i, 
corresponding to the term dO of Figure 3, is given 


by 


dO = max(dFi,dPij) + dLj + max(dFi,dPji) (3.3) 
If we assume that condition (3.1) is satisfied 
(dFi >= dLi and dFj = dLj), then dO given by 


(3.3) satisfies condition (3.2). 

The minimum delay in transferring two 
successive pieces of data (e.g. successive words 
which are part of the same message) between the 
pair of Asynchronous BA modules i and j is equal 
to the maximum loop delay as given below. 

dABAij = dLi + max(dFi,dPij) + dLj 

+ max(dFi,dPji) (3.4) 
Consider next all of the pairs of communicating 
modules along a particular path k in the network. 
Since data transfer is pipelined, we next 
determine the maximum delay between module pairs 
on that path. 

The path k is. modelled as shown in Figure 5 
where each pair of communicating modules and the 
maximum loop delay associated with that pair is 
shown. Since the network is pipelined, the 
minimum time between transfer of two successive 
data items along the path k, dABAk, is given by 
the maximum of the delays dABAI2, dABA23,...., 
dABA(n-1)n.. 

dABAk = 2dL + 2(max(dPk,dF)) (3.5) 
where dL and dF are the maximum values associated 
with the combinational logic and feedback delays; 
dPk is the maximum delay between modules for path 
k, that is . 

dPk = max(dPij,dPji) for all (3.6) 

communicating modules i and j in path k 

Notice that dPk will be dependent on _ the 
particular path under consideration since this 
delay reflects the layout of module chips on a 
printed circuit board and that layout is in turn 
dependent on the topology of the network being 
considered. The average of dABAk over all paths 
(a total of M are present) in the network gives 
the average delay between successive data 
transfers. Assuming all paths in the network are 
equally used this average can be expressed as 


M 
dABA = ( ) | dABAk )/M 
k=] 
We will assume here that the maximum delays 
associated with the combinational logic and 
feedback are equal for all paths in the network. 
Then if dPk >= dF for all k, equation (3.7) 
becomes: 


(3.7) 


M 
dABA = 2dL + 2( >> dPk)/M (3.8) 
k=] 
Letting dPBA be the average of dPk over all paths 
we obtain 


dABA = 2dL + 2dPBA (3.9) 
M 
where dPBA =) dPk/M 
k=1 
iad Synchronous Banyan Delay Model 


A pair of synchronous modules in a path from a 
source to a destination is modelled as a finite 
State machine in Figure 6. A two phase clocking 
scheme is used to clock the memory elements 1 and 
2 of each module. Considering module i, the 
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maximum combinational logic delay is dLi, the 
memory delays are dMil and dMi2, the 
interconnection path delay from module i to module 
j is dPij, and the clock delays are.dCil and dCi2. 
Similarly for module j. 

The following three constraints on the clock 
period (obtained as in [11]) must hold to ensure 
proper operation: 


T >= dMil+dMi2+dPi j+dLj+(dCil-dCjl) (3.10) 
T >= dMi2+dPij+dLj+dMjl+(dCi2-dCj2) (3.11) 
T >= dMil+dMi2+dLi (3.12) 
In most designs the third constraint on T is 


smaller than either of the first two and will not 
be considered further. The quantities (dCil-dCjl) 
and (dCi2 - dCj2) are the differences between the 


times the phases 1 and 2 of the clock arrive at: 
the corresponding memory elements of the two 
modules and are referred to as the clock skew, 
defined as 

deltaCl = dCil - dCjl (3.13) 

deltaC2 = dCi2 - dCj2 (3.14). 
If dM, dPBAmax, dL and delta represent maximum 
values which can occur over any data path, then 
the constraints of (3.10) and (3.11) can be 
written as 

T >= dL + 24M + dPBAmax + delta (3.15) 


where dPBAmax is the maximum path delay between 
any pair of communicating modules over the entire 
network, that is, 
dPBAmax = max(dPij,dPji) for all 
communicating modules (3.16) 
Another constraint on the clock period relates 
to the clock tree charge/discharge time. For 
reliable operation of the system the clock period 
must be greater than the time required to charge 
and discharge the clock tree to voltage levels 
which can be reliably sensed by the gates in the 
network. Let this time be represented by tau and 
thus T >= tau. The worst case condition clock 
period for the Synchronous BA network, dSBA, is 
now given by 
dSBA = max(dL+2dM+dPBAmax+delta, tau) (3.17) 
3.2 Asynchronous Crossbar Delay Model | 
The delay model for the CB network is obtained 
in a similar manner as for the BA network. The 
maximum Asynchronous CB loop delay for modules i 
and j is then obtained as 
dACBij = dLi + max(dPij,dFi) + dLj 
+ max(dPji,dFi) (3.18) 
Since data is pipelined as in the BA network, we 
obtain the average delay as dACB given by 
dACB = 2dL + 2*max(dPCB,dF) (3.19) 
where dPCB is the path delay between two 
communicating modules. It should be noted that 
because of the planar construction of the CB 
network, the distance between two interchip 
communicating modules is constant independent of 
network size. The maximum path delay between two 
communicating modules is a constant not dependent 
on any particular path being considered. This is 
a key difference in the analysis of the two 
networks. If dPCB >= dF then equation (3.19) can 
be written as 


dACB = 2dL + 2dPCB (3.20) . 


3.3 Synchronous Crossbar Delay Model 
The synchronous delay model for the CB network 
is similar to the model developed for the BA 


network (Figure 6). The delay in the network is 
given by dSCB where 
dSCB = max(dL + 2dM + dPCB + delta, tau) (3.21) 
Note that in the CB network, dPCB is also the 
‘maximum delay between two communicating modules. 
4.0 Delay Parameters 
Consider next the various delay parameters 
needed for evaluation of dABA, dSBA, dACB and dSCB 
in Equations (3.9), (3.17), (3.20) and (3.21) 
respectively. To estimate these values’ the 
synchronous and asynchronous modules were designed 
using NMOS technology with a minimum feature size 
of 2.5 microns. From these designs values for dL 
and dF were obtained (see section 5.0). 
Considering the path delays dPBA, dPBAmax and 
dPCB, the delay of on-chip paths is negligible 
compared to the delay of off-chip paths. The 
off-chip delay can be minimized by using 
exponential drivers and is given by [5]: 
dP = d*e*1n(CL/Cg) 
where Cg is the capacitance of an elemental gate 
and CL is the load capacitance. The capaciatance 
CL consists of two pin capacitances (Cpin) and the 
external path capacitance. 
maximum path length between communicating modules 
in any path is L2 (Figure 7), and (4.1) becomes: 
dPCB = d*¥e*1n((2Cpin+Cb*L2)/Cg) (4.2) 
where Cb is the capacitance per unit length of the 
printed circuit board. For the BA case dPBA and 
dPBAmax are derived in [12] as: 
dPBA = d*e*ln(2Cpin + Cb*LI1 
+ ((N*¥*2+1)*(N-1) /2N**4)*N°*Cb*L2/Cg) (4.3) 
dPBAmax = d*e*In(( 2Cpin + Cb*LI1 
+ (N-1)4N°*Cb*L2/N**4)/Cg) (4.4) 
where Ll and L2 are the spacing between chips used 


(4.1) 


in the BA and CB networks as shown in Figure 7, N 
is the network size in a chip module and N” the 
overall network size. This more complex 
expression reflects the changing path lengths 
between banyan network stages. 

We next consider the clock skew. For this 


analysis it is assumed that the clock as presented 
to the individual chip modules has no_ skew and 
that all skew occurs within the chip. The clock 
skew can be attributed to : 

(a) Differences in line lengths. 

(b) Differences in the passive line parameters 
like resistance, dielectric constant that 
determine the line time constant. 

(c) Differences in the threshold voltages of 
the two modules. 

One possible clock distribution scheme that 
guarantees equal length paths, thus eliminating 
(a) from consideration is shown in Figures 8 and 
9. As shown in Figures 8 and 9, the section AB of 
the clock tree is common to all the modules in the 
chip and hence does not contribute towards the 
clock skew. Let the maximum and minimum time 
constants of the clock tree from B to all the leaf 
nodes be RCmax and RCmin. Then, given equal 
length paths the clock skew can be found as [11]: 
delta = RCmin*1n(1-(VImin/Vdd)) 

- RCmax*1n(1-(VImax/Vdd) ) (4.5) 
where VImax and VImin are the maximum and minimum 
values 
the gates of the network and Vdd is the power 
supply voltage. RCmax and RCmin can be obtained 
as functions of the clock tree time constant RC 


assoclated with the threshold voltages of: 


For the CB case, the © 
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(e.g. RCmax=k1*RC, RCmin=k2*RC). 
Determination of the clock tree time constant 
is a problem that has been solved in [4] and [6]. 
Using the development of [4] for the tree starting 
at B (Figures 8,9) we get RC as: 
3 
RC = 9*(1-1/N)*(N - 1)*ROCO/7 (4.6) 
where RO and CO are the resistance and capacitance 


of the last (and smallest) section of the clock 

tree. The time constant (RCf) for the entire 
clock tree (starting at A) is: 
3 

RC£ = (3 - 2/N)*(10N —- 3)*ROCO/7 (4.7) 

The total time to charge and discharge the clock 


tree, has been derived in [12] and is given 


by 


tau, 


Vdd-VTmin+Vn VTmax+Vn 
RCf£* 1n( Sa a ). in( SSHeee= (4.8) 


VImin-Vn 


tau = 


where Vn is the noise margin rquired for reliable 
circuit operation. Notice that for’ this 
simplified analysis RCf is a function of the 


module size N and not the overall network size N”. 
That is, only the clock tree charge/discharge time 
within the chip module is considered. 

5.0 Example 

As an example let us consider a N°*N” network 
built from N*N size module chips which are laid on 
copper printed circuit boards. The pin 
capacitance for this type of construction is about 
4pF and the capacitance of an elemental gate is 


0.0lpF. The various delays are: 
d = 2 nsec; dM = 4 nsec 
dL = dF = 45 nsec. for synchronous module 
dL = dF = 34 nsec. for asynchronous module 


We will assume that the size of the largest chip 
available is Icm*lcm on which a single bit slice 
32*32 network can be implemented. Assume that 
only one layer of metal is available. Let a 
fraction q of the clock line be distributed in 
diffusion and the rest in metal. If Rd is the 
resistance per square of diffusion, Cd and Cm the 


capacitance per unit area in pF/sq.micron of 
diffusion and metal respectively, then the time 
constant of the last section of the clock tree 


ROCO is given by [12]: 
ROCO = ((10000/32)**2)*Rd*q*(2*q*Cd 


+ 3*(1-q)*Cm)/8000 nsec (5.1) 
The fabrication constants of the current NMOS 
technology have the following values: 
~4 
Rd = 20 ohms/sq., Cm = 10 pf/sq. micron, 
—4 
Cd = 0.3*10 pF/sq. micron 


Also, the variation of time constant and threshold 
voltage during fabrication is about 20% (i.e. 
RCmax=1.2*RC, RCmin=0.8*RC). We take the supply 


voltage Vdd=5 V, the typical threshold voltage as 
2.5 V and the noise margin as 0.5 V. Then 
VImax=3.0 V, VImin=2.0 V and Vn=0.5 V. For 
illustration purposes, take q=54, Ll=l inch, L2=2 


inches and Cb=1 pF/inch (note values of Ll and L2 
are dependent on board technology factors such as 
whether multilayer, wirewrap or other technology 
is used). The delays dABA and dACB for the 
asynchronous modules and the delays dSBA and dSCB 
for the synchronous modules of the BA and CB 
networks are plotted in Figures 10 and 11 against 


N°, the network size. In Figure 1z tne 
delays are plotted against N, the module size. 
From Figure 12 we can make a comparison of the 
delays associated with the asynchronous and 
synchronous control schemes. It is clear that in 
the case of the Banyan network, for small N the 
synchronous control scheme results in a smaller 
delay. Consider a particular network size N’. 
From the intersection of the curves for the 
asynchronous and synchronous control schemes for 
this value of N°, we can obtain the range of 
values of N for which a particular control scheme 
is better. For example, for N°=512 we get the 
following result: 
( Synchronous control better for N < 20 


same 


BANYAN 
Asynchronous control better for N > 20 


Similar conclusions can be reached for the CB 
network. network curves. Notice that the CB 
network delays are independent of the network size 
because the intermodule distances are constant 
independent of the module or network size. Thus 
‘we get the following result: 

Synchronous control better for N < 24 
CROSSBAR 
| | AG eueneedous control better for N > 24 
6.0 Conclusions 


The Banyan and the Crossbar networks have been 
modelled according to the type of control scheme 
used. These models were used to obtain the delay 
associated with 
scheme. The delay equations 
obtain the 
Figure 12. 

Comparison of the delays in the Banyan and the 
Crossbar network were, made for both types of 
control schemes, the synchronous and the 
asynchronous. It can be observed from Figure 10 
that the BA network deiays increase with N”*, the 
network size, because of the increase in the 
physical inter-~chip path lengths. The 
asynchronous network delay decreases with the 
module size because the implementation of a 
network of a given size requires fewer modules 


were then used to 
delay curves of Figures 10 and 11 and 


chips and hence the physical path lengths between 
communicating modules decrease. Physical path 
lengths decrease with N for the synchronous case 


alsc, however here large modules result in larger 
clock skew which dominates the inter~chip delay. 
The synchronous network delay is the maximum of 
two terms. For small module sizes the first term 
(includes inter-chip delay and clock skew) of 
(3.17) dominates, while for large module sizes 
(e.g. N=32) the clock tree charge/discharge time, 
tau,dominates. Tau increases rather rapidly 
(O(N**3)) with the module size and this results in 
the steep rise in Figure 12 for large N. The 
behaviour of the delays of the Crossbar network 
were similar except that the planar construction 
of the network resulted in smaller delays. Notice 
also that only a single curve is obtained for the 
asynchronous case because of the modular and 
planar structure of the Crossbar network. This 
was also the reason for the delays in the Crossbar 


network for both types of control schemes being 
less than those in the Banyan network. Delay 
curves were obtained for a particular example 


clearly showing the delay tradeoffs for the Banyan 
and Crossbar networks, and the synchronous and the 


each type of network and control. 
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isynchronous control schemes. 
REFERENCES 


[1] Fang,T.P. "On the Design of Hazard Free 
Circuits", Comp. Sys. Lab., Tech. Mem.. 
285, Washington University, St.Louis, MO (Nov. 
81). 

Franklin,M.A., Wann,D.F. and Thomas,W.J. 

"Pin Limitations and Partitioning of VLSI 
Interconnection Networks", IEEE Trans. on 
Comp., Vol. C-31, No. 11, Nov. 1982. 
Goke,L.R. and Lipovski,G.J., "Banyan Networks 
for Partitioning Multiprocessor Systems", 
Proc. lst Annu. Symp. Comput. Arch., 1973. 
Kung, S.Y. and Gal-Ezer,R.J. "Synchronous vs 
Asynchronous Computation VLSI Array 
Processor", Proc. SPIE, Vol. 341, May 1982. 
Mead,C. and Conway,L., INTRO. TO VLSI 
SYSTEMS, Addison-Wesley Pub.Co. Reading 
(1980). 

Penfield, P. and Rubinstein,J. 
Delay in RC Tree Networks", Proc. 
Auto. Conf., June 1981. 
Seitz,C.L., "Self-timed VLSI Systems", Proc. 
Caltech Conf. VLSI, Jan.1979. 
Sejnowski,M.C.,et. al. “An overview of the 
Texas Reconfigurable Computer”, AFIPS Proc., 


[2] 


[3] 


[4] 


in 


[5] 
, MA 


[6] "Signal 


18th Design 
[7] 
[8] 


Nat. Comp. Conf. (1980). 

[9] Sullivan,H. and Bashkow,T.R. “A Large Scale 
Homogeneous, Fully Distributed Parallel 
Machine I”, Proc. 4th Ann. Symp. on Comp. 
Arch. (March 1977). 

[10] Swan, R.J. et. al. “Cm* A Modular Multi- 
Microprocessor’, AFIPS Proc. Nat. Comp. 
Conf. (1977). 

[11] Wann, D.F. and Franklin, M.A. “Asynchronous 


and Clocked Control Structures for VLSI Based 
Interconnection Networks", IEEE Trans. on 
Comput., March 1983. 
Dhar,S., Franklin,M.A. and Wann,D.F. 

"Timing Control of VLSI based NilogN and 
Crossbar Networks", Center for Computer 
Systems Design, Washington Univ., St. Louis, 
MO, Tech. Rpt. CCSD83-101 (May 1983). 


[12] 


Rl = Rl 
RO RO 
A A 
MEMORY 
dMj 2 
Figure 1: Interconnection of two asynchronous 
module. 
DATA DATA 
MODULE i MODULE j 
%) 
0, 


“igure 2: Interconnection of two synchronous 
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Figure 3: Huffman asynchronous circuit model 
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Figure /: 8*8 Crossbar and Banyan networks 
built from 2*2 module chips. 
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Figure 4: Delay model for two adjacent asynchronous modules. 
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Figure 10: Delay variation of Banyan network as a function of 
network size N' with module size N as a parameter. 
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Figure 12: Delay variation of Crossbar and Banyan networks 
as a function of module size N with network 
size N' as a parameter. 
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Figure ll: Delay variation ofCrossbar network as a function of 
network size N' with module size N as a parameter. 
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Figure 9: Clock distribution for an 8*8 Crossbar network. 
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Abstract -- This paper focuses on the testing of an important 
class of interconnection networks called (N,K) shutfle/exchange 
networks. A sequential circuit model is used for the basic switching 
element. A general fault model for the switching element is 
introduced, A testing strategy is presented which involves the 
exhaustive testing of each switching element without exhaustively 
testing the entire network. Each switching element is exhaustively 
tested via the application of a checking sequence. It is shown that 
the class of (N,K) shkuffle/exchange networks is C-testable. A 
network js C-testable if it can be fully tested using a constant 
number of test patterns. A test sequence of constant Jength is 
constructed whicn when applied to a {N,K) shutfle/exchange 
network will fully test the entire network. 


1. Introduction 
Previous works 


In recent years, many multistage interconnecticn networks have 
oeen proposed and extensively studied. Most research efforis 
focus on the network topologies, routing algorithms, and potential 
applications [7]. More recently, issuses involving reliability, fault 
tolerance and fauit diagnosis are being addressed [1]. There has 
been limited investigation of the testing and testability of such 
networks, which constitute the focus of this paper. 


Mcst of the previous works assumed each basic switching 
element to be a combinational circuit and each requires a separate 
contro! line. Most of the fault models assumed are quite restrictive. 
Frequenily only faulis involving single line stuck at a logic value or 
single switching element stuck at a switch state are considered. A 
more general fault model ts needed. 


(N,K) shuffle/exchange networks 


A p-element is a 2 x 2 switching element that can be set to one of 
two states, namely the "Through" (0) state or the "Cross" (1) state, 
corresponding to the two possible permutations of its two input 
terminals; see Figure 1a. A B-element can be implemented as a 
two-state sequential circuit; see Figure 1b. Each £-element in a 
network can be independenily set to either the 0 or the 1 state. To 
facilitate se:f-routing and to reduce the number of I/O pins needed, 


Levitt et al [6] propose a B-element which uses the two data inputs 


a and b for transmitting data as weil as destination address tag bits 
used for routing. A third input, c, determines whether the input 
terminal lines certain data or routing tag bits. Similar routing 
scheme is assumed in this paper. Details of it can be found in [5]. 


i N_x_N shuffle/exchange stage has N input terminals and N 
output terminats and consists of a perfect shuffle connection [9] 
followed by (N/2) §-elemenis. For convenience, N is assumed to 
be a power of 2. Let m = log,N, then each terminal can be 
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(a) The beta element 


input vector <abc> 


State 


(b) The state table of the beta element 


Fig. 1. The 8 element and iis state table. 


identnied by a binary number of m bits. Starting from the top, each 
B-element can be identified by a binary number of m-1 bits. All the 
f-eiements in the same staye examines synchronously the routing 
tag bits. A single control signal c can be used fer al! the B-elements 
in the same stage. This scheme reduces the nurniper of control 


signals needed from (N/2) down to one per stage. 


Fig. 2. The (8,4) shuffle/exchange network. 


A | n has N input terminals, N 
output terminals, and consists of a cascade of K identical N x N 
shuffle/exchange stages. The stages can be numbered trom left to 
right as 1,2,...,K. The outputs from stage i are connected to the 
inputs of stage i+ 1 as shown in Figure 2. (8, Bm.3Bo); denotes 
the B-element (8, , 8, 4--B,) in the jth stage. 


It is assumed that the (N,K) shuffle/exchange network uses a 
routing scheme involving destination address tags [6]. Before 
inputing data at an input terminal a K-bit routing address tag (d, 
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dy..dids one for each stage, is used to set the switches so as to 
provide the desired connection path. The B-element in the ith 


stage examines d. and sets its state according to ine value of d.. 


Many wall known networks are (N,K)} shuffle/exchange networks 
[7]. When K = log.N, the (N,K) shuffle/exchange neiwork is the 
omega network, and is topologically equivalent to a class of well- 
kriown networks. These networks include the modified data 
manipulator, the flip network used in STARAN, the indirect binary 
n-cube network and the regular SW banyan network with spread 
and fanout of 2. When K = 1, it is the well known shuffle/exchange 
network proposed by Stone [9]. 


In Section 2, the fault model and the testing strategy is formally 
introduced. In Section 3, the concept of C-tesiability is introduced 
and applied to the testing of (N,K) shuffle/exchange networks. In 
Section 4, it is shown that the class of (N,K) shuffle/exchange 
networks is C-testable. A test sequence is constructed whose 
length is independent of the network size. 


2. tasting Methodology 


Beta-elemenit fault model 


The state table of a sequential machine completely characterizes 
the machire’s behavior. Since a B-element is modelled as a 2 state 
sequential machine, any B-element failure which causes an 
arbitrary change to ihe original state table is considered a fault. 
Our fault model assumes: 


._ A fault in a B-element is any arbitrary deviation from the 
fault-free state tabie of the B-element 
increasing the number of states. 


without 


. There is at most one faulty B-element. 


3. The fault is permanent. 


* + faulty signals 


input vector <abc> 
next state/ 
outputs <xy> 001 010 011 100 101 110 
*x 
current 10 0/11 1/11] 0/10 1/10 | 0/11 


state 


Fig. 3. Input line a stuck at 1. 


input vector <abc> 


next state/ 
outputs <xy> 


current 
state 


Fig. 4. Output line y stuck at 0. 


input vector <abc> 


next state/ 
outputs <x 


current 
state 


Fig. 5. B-element stuck at the 1 state. 


As shown in Figures 3, 4 and 5, a conventional fauit involving 
either a line stuck at some logical value or a B-element stuck at the 
0/1 state can be represented using this fault model. Many other 
fault types can be modelled. This fault model is quite compatible 
with VLSI implementations. On a VLSI chip, faults tend to be 
arbitrary but confined to certain area of the chip. 


Exhaustive testing of beta elements 


Since the fault model allows a B-element to fail in an arbitrary 
way, each B-element must be exhaustively tested. A B-element, 
modelled as a 2 state sequential machine, can be exhautively tested 
using the checking experiment approach. Hennie described a 
method for sequential. machine identification using a checking 
experiment [3]. A checking experiment for a machine involves the 
construction of a checking sequence of the rnachine. A checking 
sequence consists of an input sequence and the. corresponding 
output sequence which can uniquely characterize a sequential 
machine. The sequential machine must have a strongly connected 
state diagram and a distinguishing sequence in order for a 
checking sequence to exist. A distinguishing sequence is an input 
sequence the application of which allows the current state of the 
machine to be determined from the output sequence. A checking 
sequence must perform the following three functions: (1) Initialize 
the machine into a known state S. (2) Verify the number of states in 
the machine. (3) Starting from state S, for every entry in the state 
table, an input vector is applied to stimulate that entry and then a 
distinguishing sequence is applied to verify the state transition. 


lf at any time during the checking experiment, the actual machine 
responds in a manner other than that dictated by the expected 
output sequence, the sequential machine must be faulty. Ifa 
B-element behaves correctly throughout the checking experiment 
and assuming the number of states has not been increased by a 
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(b) Distinguishing sequences: <010>,<100> 
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(c) Synchronizing sequences: <001>,<111> 
Fig. 6. State transition diagram and some 
useful input sequences of the 8-element. 
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fault, then the state table of this machine must be the same as the 
fault free one. By carefully designing the input checking sequence, 
the length of the sequence can be reduced [8]. Figure 6 shows the 
state transition diagram and some useful inout sequences of the 
B-element. The diagram is strongly connected. It has two 
distinguishing sequences of length one, namely the two input 
vectors <abc> = <010> and <100>. Futhermore it has two 
synchronizing sequences, <001> and <111>, also of length one. (A 
synchronizing sequence is an input sequence which when applied 
to a machine results in a unique final state independent of the initial 
state.) 

The checking experiment for the f-element involves the 
application of a synchronizing sequence (<001> or <111>) followed 


by the activation of a state transition and then followed by a. 


distinguishing sequence (<010> or <100>) for each of the 16 entries 
or state transitions in the state table. A (N,K) shuffle“/exchange 
network has N/2*K $-elements. if the entire network is considered 
as a single sequential machine, it will have gN/2" K states. It is 
infeasible to design a checking sequence for such a state machine 
even for relatively small N and K. Consequently, the appropriate 
testing strategy is to exhaustively test each B-elernent without 
exhaustively testing the entire network. The main task now is to 
construct the smallest possible sequence of network input test 
natterns which will result in the efficient simultaneous application 
and observation of the checking sequences to all the 8-elements. 


3. C-testability and Test Vectors 


C-testability 


The problem of testing iterative arrays was first studied by Kautz 
[4], who assumed that an individual cell can be tested for all its 
possible faults only by applying all possible input vectors to that 
cell. The necessary and sufficient conditions were given by Kautz 
for testing an iterative array with a single faulty cell. Friedman [2] 
studied a class of one-dimensional unilateral combinational 
iterative arrays which requires a constant number of tests to detect 
all faults, independent of the size of the array. He called them 
C-testable iterative arrays. He also assumed that there is only one 
faulty cell in the array. In this paper, the concept of C-testability is 
generalized and applied to two-dimensional (N,K) shuffle/exchange 
networks. Unlike the combinational arrays studied by Kautz and 
Friedman, a (N,K) shuffle/exchange network consists of cells which 
are sequential circuits. Applying Kautz’s necessary and sufficient 
conditions to (N,K) shuffle/exchange networks we have the 
following: 


Definition 1: A (N,K) shuffle/exchange network is testable if the. 


following conditions are met: 


1. For each B-element, all entries in the state table can be 
stimulated, i.e. all the state transitions can be activated, 
and then verified. 


2.For each f-element, any faulty signal produced by a 
faulty B-element can be propagated to an observable 
network output. 


Condition 1 is necessary and sufficient for the exhaustive testing 
of every B-element. Condition 2 ensures the detection of any faulty 
signal. 


Definition 2: A (N,K) shuffle/exchange network is C-testable if it 
is testable and the number of network test vectors, or test,patterns, 
required is a constant and independent of the size of the network. 
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Four useful test vectors 


A test vector for the (N,K) shuffle/exchange network consists of 
two sub-vectors. The first sub-vector consists of the data inputs to 
the B-elements in the first stage, and the second sub-vector are the 
K control signals, Cys es Cy. Since all K control signals are 
normally inactive except when new communication paths are being 
established by routing tag bits, during which time the same active 
signal is applied to all K control lines sequentially, hence all K lines 
can be considered as one logical control line. T = Ct tcligg ls 
used to denote the test vector, where t is the input to the ith input 
terminal of the (N,K) shuffle7/exchange network and c is the control 
signal input. A shift register can be used to shift a c pulse to 
successive stages in synchronism with the arrival of destination 
address tag bits at the data inputs of successive stages. T is used 
to denote <f, f,..-§). >°, where f = O(or 1) ifi, = 1 (or 0). 


Definition 3: The ith terminal of any stage with i = (p 
Pn o-Pp) is even-weighted if 2" p, = 
odd-weiahted if "5 p, 


n-1 
an even number, and is 


> = an odd number. 


Four useful network test vectors are now introduced. Let (D4 
Do Po) be the binary representation for i, denoting the ith input 
terminal of the network. We define the four test vectors as follows: 


1.T)° = <00...0>°ie.t, = Oford <i <gN-1 
= <(11...D%ie.t = 1forO<igNn-1 


= <O1...1>° where 


= 
ree 
i 


0 for all even-veighted t.. 


oS 
or 
r 


1 for all odd-weighted t.. 


4, ee = €10...0>° where 


© 
-- 
I 


1 for all even-weighted t. 


i 


b. t. = 


O for all odd-weighted t.. 


Note that Th = ey and i‘ = ee The four test vectors for a 


shuffle/exchange network with N = 8 are illustrated below: 
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Under the fault free condition, the test vector To will apply <O0c> 
to each B-element in the (N,K) shuffle/exchange network and the 
test vector Ls will apply <11c> to each B-element in the (N,K) 
shuffle/exchange network. When the £B-elements in a 
shuffle/exchange stage are all in state 0, then the input terminal 
(D4 Pna°Po) to this stage is connected to the output terminal 
(Dayo Pm.g*P9 Prm-a) of the same stage [8]. When the B-elements in 


a shuffle/exchange stage are all in state'1, then the input terminal 
(P,.1 Pm.oPo) to this stage is connected to the output terminal 
(D9 Pry.g:*Po Dyn. 4) Of the same stage [8]. 


Lemma _1 Under the fault free condition, the test vector Te will 
apply <CO1> ta each B-element and set all B-elements in a (N,K) 
shuffle/exchange network to the O state; the test vector T.? will 
apply <000> to each £-element regardless of the current states of 
the B-elements. 


Proof: By definition, the test vector T,' will apply <001> to every 
B-element in the first stage. From the state table of a B-element, the 
outputs should be <OQQ> and the next state will be state 0. The 
output from the first stage are all 0’s, then the input to the following 
stage are all 0’s too. Hence every B-element in the subsequent 
stages will receive the input vector <001>. T,° will apply <000> to 
each B-element in the first stage. The same foregoing argun:2nt 
applies except the next state is still the same as the current state. A 


lemma 2 Under the fault free condition, the test vector Te will 
apply <111> to each B-element and set all B-elements in a (N,K) 
shuffle/exchange network to the 1 state; the test vector T will 
apply <110> tc each B-element regardless of current states of the 
B-elements. 


Proof: Similar to the proof of Lemma 1. 


ate coterie Sancta 


(N,K) shuffle/exchange network is in state 0, then the test vector 
T,° (or T°) will apply the same input vector <tot, ty .>° to all K 
stages. Every 8-element remains in state 0 after the test vector is 
applied to the network. 


Proof: We know that the ith stage of B-elements simply connect 
the input terminal (D1 Pm.a"Po) of stage i to the output terminal 
(PD 2Pm.g'PpPmn.+) of stage.i, which is also the input terminal 
(Do Png P oP. 4) of stage i+1, for-all 1 < i < K. Since the ith 
stage only permutes the terminal (Dn 4Pry.o'Po) the input values t., 
0<j <N- 1, of the (i+ 1)th stage is the same as that of the ith 
stage. So if the test vector for the first stage is <tot, bat then 
every subsequent stage will receive the same vector <tot, Be re 
under fault free condition. Since the control signal c is 0, the 

A 


B-elements do not change states. 
Figure 7 illustrates Theorem 1. The current state, x, and the next 
state, y, of each B-element are denoted as x/y. 


ford 


Fig. 7. Illustration of Theorem 1 using Te 


Theorem 2 Under the fault free condition, if every B-element in a 
(N,K) shuffle/exchange network is in state 1, then the test vector 
T° (or T°) will apply the input vector <toty ety. >° to the input 
terminals of the first stage of B-elements. For all the subsequent 
stages, 1" (or T,°) will apply the same vector <tgt, vaty .>° to the 
input terminais of the odd stages, and the complemented vector <f, 


{,...f,,>° to the input terminals of the even stages. All the 
B-elements remain in state 1. 


Proof: The proof is similar to the proof of Theorem 1. A 


4. C-Testable (N,K) Shuffle-Exchange Networks 


Test Sequence |: Stimulation 


The stimulation of each entry in the state table of every 
B-element is considered first. If the test sequence J, = {T,° T,° 
T,° T,,°} is applied to a (N,K) shuffle/exchange network, with all its 
B-elements in state 0, the entries with current state = O and input 
vectors <000>, <010>, <100> and <110> will be stimulated. For 
convenience, the symbol [s,<xxx>] is used to denote the entry in the 
state table with current state s and input vector <XXX?. Thus the 
above stated entries are denoted as [0,<000>], [0,<010>], [0,<100>] 
and [0,<110>}. 


lf all the B-elements are in state 1, by Theorem 2, the test 
sequence J, = {T,°? T,° T,° T,°} will stimulate the following 
entries [1,<000>], [1,<010>], [1,<100>] and [1,<110>]. The 
stimulation of those entries in the state table involving state 
changes are now considered. Using similar arguments as in the 
proofs for Theorems 1 and 2, we can derive the following results. 


~— 
—e 


Theorem 3 Under the fault free condition with all the B-elements 
in state O, the test vector T, ' will apply the same input vector to the 
input terminals of every stage of the network. All the B-elements 
(B yo Bm.gBo) with sree. even will receive the input vector 
<011> and remain in state 0 and the B-elements with 278. = odd 
will receive the input vector <101> and then change from state 0 to 
state 1. (See Figure 8.) 


— 
— 


1 1 1 


Fig. 8. Illustratin of Theorem 3 using T,'. 


Corollary 1 Under the same circumstance as in Theorem 3, the 
test vector it will apply the same input vector to the input 
terminals of every stage of a (N,K) shuffle/exchange network. 
Every B-element with sr oB, odd will receive the input Moelel 
4011> and remain in state 0, while every B-element with PHS = 
even will receive the input vector <101> and then change from state 
0 to state 1. 


Theorem 4 Under the fault free condition with all the B-elements 
in state 1, the test vector Te will apply the input NEC <ty 
t ea to all odd stages and the input vector <t, f. “st? to ail 
even stages. For a B-element in the jth stage, if j + x oP; = odd, 
the B-element will receive the input vector <011> and change from 
state 1 to state 0, and if j + so B; even, the f-element will 
receive the input vector <101> and remain in state 1. (See Figure 9.) 
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Fig. 9. Illustration of Thecrem 4 using ie 


Coroilary 2 Under the same circumstance as in Theorem 4, the 
test vector os will apply the input vector <fy ot. to all odd 
Stages and the input vector <ty t, ly: oe to all even stages. Fora 
(-element in the jth stage, if j + a EF odd, the §-clemant will 
receive the input vector <101> and remain in state 1, and if j + 
la even, the f-element will receive the input vector <011> 
and change from state 1 to state 0. 


From Lerma 1, Theorem 2 and Corollary 1, if all the B-elements 
in ihe network are in state G and the entire network is fault free, 
applying the test sequence % = {T,' T,' Es T, } will stimulate 
the four entries, [0,<011>], [0,<001>], [0,<101>], and [1,<001>}], in all 
the B-elements. After the test sequence, all the B-elements of the 
network will be back in state O as they were in before the 
application of this test sequence. 


Another iest sequence J = Li 5 1 ae is useful when 
all the B-elements are in state 1. Again assuming the entire network 
is fault free, then from Lemma 2, Theorem 4 and Corollary 2, this 
test sequence will stimulate the four entries, [1,<101>], [1,<1141], 
[1,<011>], and [0,<111>], in all the B-elements. In the process of 
aspiying Y,, all the 8-elements start in state 1 and return to state 1. 
As can be seen, using the foregoing four test sequences, al! sixteen 
entries in the state table of each -element can be stimulated. 
Figure 10 iilustrates the entries stimulated by the four test 
sequences. 
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Fig. 10. Entries stimulated by 4: 


Test Sequence II: Verification 


Under the fault free condition, as the above test sequences are 
applied, the entire network at any point in time can bein only one of 


six possible states. Denote the jih B-element of ith stage with (8 me 


Bina Po), where BL, BoB = i. These six states of the network 


are: 


@S,:allthe £-elements are in state 0 


oS, : all the B-elements are in state 1 


28. = even are instate 0 


: case ST 
®S,: 1. B-elements with x ob, 


odd are in state 1 


2. 2-elements with xe, 


©S.,: 1. B-elements with Hae = even are in state 1 

2. B-elements with a OB, = cd are in state 0 
°S,: 1. every (j). is in state 0 ifi + a OB, = odd 

2. every (i); is in state 1 if i + a OB, = even 
°S,: 1. every (i), isin state 1 ifi + x GB, = odd 

2. every (j), is in state O if i + Ha = even 


Far each of the six possible network states, a set of network test 
vecior(s) is needed to verify the current states of all the £-elements. 
4 


Definition : A set of input vector(s) to a (N,K) 
shuffle/exchange network is a distinguisning sat of test vectors or 
simply distinguishing set if when they are applied to the network, a 
distinguishing sequence, i.e. the incut vector <010> or <100>, is 
applied to each B-element in the network. 


Theorem 5 {7°} is a disiinguishing set, if the network is in state 
So or state S,. {T°} is also a distinguishing set for state So or 
state Sy: 

Proof: The ~-element (B y-2B m3 B Bo): receives its input a 


from the input terminal (Of ob, ) and input b from the input 


terminal (185: B Bo): When Te is applied to a network, a 
B-element will receive a <010> if = 8. = even and a <100) if pee 


of. = odd. Similarly, when T,° is adplied, a B element will receive 
: _ : bed 2 neg win? 
a <100> if ee B, = even and a<010> if 2 A (i. = odd. From the 


discussion above, it is easy to s 


ee that either T° or To will apply a 
distinguishing sequence to each B-element in the network if all the 
B-eiements are in the same siate. A 


To verify that all the B-elements are in state 0, i.e. the network is 
in rat we need to apply the distinguishiag set, 7,” (or ie) to the 
neiwork after the application of each of ihe four test vectors in ai 
Similarily, i ( or i) is needed after each test vector in 3, to 
verify that all the f-elements remain in state 1. Note that each of 
these distinguishing sets consists of only one input vector. 


Two particular test vectors, to be used for constructing 
distinguishing sets, are now defined. A test vector called M sda = 
<Vaed Yano ie 's defined as follows: v, = 0 for all terminal i 
Lincs Den oPa?s such that 2 odd p, = even, and V; 1 for all 
terminal i such that 2 odd = odd. Similarilv, another test 
vector called Vis defined as f 


ollows: v. = O for all i such that 
even, and Vi = 1 for all | such that 2 


— 
= ~ 


D; 


= even 

Theorem 6 LV oad constitutes a distinguishing set for a (N,K) 
shuffle/exchange network, if the network is in one of the staies, S 
Sa Sy) and Se and log,,N is odd. 


The proof of this Theorem is :ather lengthy and is omitted here 
but is fully documented in [5]. 


ad 


Complete Test Sequence 


Refore the construction of a complete test sequence, we must 
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show that any erroneous signal produced by a faulty B-element is Each of the four states: S,, 53, 54, and S,, occurs only once 
always propagated to and visible at a network output. Any faulty during the application of the orginal 18 test pales for stimulation. 
B-element in a (N,K) shuffle/exchange network, will generate a Hence the distinguishing set {Vj 4g,Veyend? Consisting of two test 
faulty signal D, denoting a logic 0 becoming a faulty logic 1, or D vectors, need to be applied only four times. Consequentiy, if log,N 
denoting a logic 1 becoming a faulty logic 0, in response to the is even, the total number of test patterns needed for verification is 
network test sequence. If a D or D singal from a previous stage 22 instead of 18. Hence, 

arrives at an input terminal of a fault free B-element, the fault free 
B-element will always propagate the D or D to one of its output 
terminals. This is illustrated in Figure 11 for the case of the 
propagation of D. Since our fault model assumes that there is only 
one faulty B-elemeni in a network, the faulty signal D or Donce The state iable of a B-element has 16 entries. In order to perform 
generated, will always be propagated to an output terminal of the the exhaustive testing of a B-clement, it is necessary and sufficient 


Conjecture 2 A (N,K) shuffle/exchange network, with Icg, N = 


an even integer, is C-testable and requires a test sequence of 40 
test vectors. 


Kth stage. 


input vector <abc> 


next state/- 
outputs <xy> 


current 
state 


Fig. 11. Fauity signal propagation by a B-element. 


We can construct a complete test sequence for a (N,K) 
shuffle/exchange network with no limitations on N and K by using 
the test sequences discussed above. This complete test sequence, 
under the fault free condition, will stimulate and verify every entry in 
the state table of every B-element. First, use T,' to set al the 
B- “elements into state 0. The test sequence J, US, = ‘in 2 a 0 
Ts mre To Oe Vy ad is then applied. After the application ‘of this 
sequence “All the B- elements should be in state 0. Now T,! is 
applied to oe all ale B- ements ne state 1. The test sequence J,, 
U%, = {Tg° T,°T,°T,° T,' T,' T,' 1,'} is then applied. The 
een postion of thé above test Ra tienes nadices a test Sequence 
of length 18 consisting of the following ag test patterns: T nu uF U 
TU ts UI, US = {Ty ToT Ps T, hs ae As 1 
i "Oy PT,%4 Tc i, Lie ee T ay, the apolicaion rofinieset of 18 
tes 5t paeras will fully exercise ail 16 transitions in the state table of 
every B-element. 


The application of each of the above 18 test patterns must be 
followed by a distinguishing set to verify that all the B-elements are 
indeed in the correct states. We assume that m log,N is odd. 
Based on Theorem 5, if the network is in either state S, or S,, the 
distinguishing set {T a or {T ) can be used. If the nelwork is in 
one of the remaining four states: S,, Sz, S, or S,, based on 
Theorem 6, {Voags can be used as the dictinguishing ee Hence, 
the complete test sequence must include the original 18 test 
patterns for stimulating alli the state table entries of each B-element 
and another 18 test patterns, i.e. distinguishing sets, for the 
verification of network states. .Hence the lengih of the complete 
test sequence, assuming m log,N is odd, is 36 and is 
independent of the network size. The foregoing discussions lead to 
the following result. : 


Theorem 7 A (N,kK) shuffle/exchange network, with log,N = 
odd integer, is C-testable and requires a test sequence of 36 test 
patterns. 


if log,N is an even integer, it is believed that the following 
conjectures are true: 


Conjecture 1 {V ogg aaa constitute a distinguishing set fora 
(N,K) shuffle/exchange network, if the network is in one of the 
following states, 5,,5,,5,, and S,, and log,N is even. 


to stimulate all 16 entries and verify all the state transitions. A 
minimum of two test vectors are needed for every eniry in the state 
table, thus a minimum of 22 test vectors are needed to exhaustively 
test a B-element. Furthermore, the initialization of the B-element 
requires another two test vectors. Hence the minimum number of 
test vectors required for testing an entire (N,K) shuffle/exchange 
network is at least 34. The test sets obtained in this section 
consisting of 36 or 40 test vectors, depending on whether log,N is 
odd or even, is believed to be the actual minimum. 


Conjecture 3 The class of (N,K) shuffle/exchange networks is 
C-testable and the minimum number of test patterns required is 36 
if log,N is odd, and is 40 if log,N is even. 


A C-testable (16,5) shuffle/exchange network has been 
designed, using three micron NMOS technology, and a layout nas 
been generated. Details of this design are documented in [5]. 
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ABSTRACT 


As a solution to the fault tolerance problem of the 
shuffle-exchange type networks, a class of networks is 
proposed which provide non-unique paths between inputs 
and outputs. The topology of the multiple paths is 
specified by means of a redundancy graph and the pro- 
cedure to construct a multipath network with a specified 
redundancy graph is presented. We provide practical 
schemes for utilization of the alternate paths and evalu- 
ate how well they perform in the presence of faults in 
the network. 


1. INTRODUCTION 


Several topologically equivalent multistage intercon- 
nection networks have been proposed in the literature for 
applications in closely coupled multiple processor sys- 
tems [3]. Such networks possess the property that 
between any input and any output there is a unique 
path made up of switching nodes. Breakdown of any 
such node or an edge thus makes some outputs iInaccessi- 
ble to certain inputs. 


As a solution to this fault tolerance problem of the 
shuffle-exchange type networks, we introduce in this pa- 
per several classes of networks called Multipath Omega 
Networks. Such networks provide multiple ways of get- 
ting from an input to an output and their close relation 
to the Omega topology [5] helps maintain all the connec- 
tion and control properties of the latter in a no-fault si- 
tuation. Multipath networks behave as gracefully de 
grading systems, operating at a reduced level of perfor- 
mance in the presence of faults, but nevertheless provid- 
ing full connectivity. 


Several papers in the recent past have considered the 
idea of using more than one path to get from a source to 
a destination in multistage networks. Some networks in- 
herently possess this property ([2], [9]), while others are 
obtained by augmenting an existing network to provide 
the multiple paths [1]. The extra stage approach in [1] 
will be seen to be a special case of the class of networks 
we discuss in this paper. 


We present, in the next section, the theoretical 
development of the multiple path networks. We first in- 
troduce a convenient means of specifying the topology of 
the redundant paths, and show how a Multipath Omega 
Network of any size with a specified redundancy graph 
can be constructed. We then provide some schemes for 
implementing the networks and utilizing the alternate 
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paths in practice, along with evaluations of their perfor- 
mance in the presence of faults. 


2. MULTIPLE PATH NETWORKS 


A B"™xB™ Omega Network [5] consists of m stages 
of BXB crosspoint switches with BxB™! shuffles inter- 
connecting the stages. A P+*Q shuffle is a permutation of 
PQ elements, 0<1<PQ-I1, whose effect when 7 is 
represented as a p+ q bit binary number is to rotate left 
the bits by p positions. A 16X16 Omega Network con- 
structed out of 4X4 switching elements is shown in Fig. 
1. In its general form, an Omega Network of N inputs 
and N outputs, where N is an arbitrary integer, is con- 
structed out of a set of switches the sizes of which 
correspond to a complete set of factors of N. 


In an Omega Network, there is exactly one path 
between any source S and any destination D. Such a 
path can be characterized by concatenating the 
n(==log.N) destination address bits to the n source ad- 
dress bits [5]: 


$98 )---8,-149d..-d, 4 


(1) 


In addition, the terminal (of a switch) that a path occu- 
pies at the output of stage 7 (O0<i<m) is given by the 
n-bit window in (1) starting at bit position 6: where 


b=logoB: 


$981 °° * [856 S664 1+ Sn-140d1--- Coy 1 [Coy Ey a2) 


The existence of such a unique path between each 
source and destination is what leads to many of the use 
ful properties of the Omega Network [5] and its distri- 
buted control algorithm. However, the uniqueness of 
paths implies that a fault anywhere in the network will 
destroy its connectivity. While reinforcing the links and 
the logic within the switches will mask some failures as 
when they occur [6], a more fundamental form of toler- 
ance is provided by eliminating this uniqueness property 
and providing more than one way to get from a source 
to a destination. In this case, if one path is faulty, it 
may be possible to find an alternate route to the destina- 
tion. We now characterize the construction and proper- 
ties of Modified Omega Networks with multiple paths 
between sources and destinations. 


An ordered factorization of N corresponds to an f- 
tuple <B,,B,,..,B,;> of factors satisfying B,B)...B,=N. 
The (1-path) Omega Network corresponding to such a 
factorization [5] consists of f stages of crosspoint 
switches, with stage + made up of B, x8, switches. In 


addition, each stage is preceded by the B; a shuffle in- 


terconnection. 


Define a pseudofactorizatton of N to be an f-tuple 
<B,,B.,...B,> of integers, with B,B)...B;,=B, that satisfy 


the following conditions: 
B>N and B/B; <N 1<j<f. 


Let B/N be equal to R. Then an R-path Omega Net- 
work corresponding to the above pseudofactorization 
consists of f stages of crosspoint switches with stage 1 
consisting of B,; xB; switches; each stage 1 is preceded 


by k, B; rane shuffles (k; >1) such that there are exact- 


ly R ways f oo from: each: source to each destination. 


We will refer to R as the redundancy of the mul- 
tipath Omega Network. Figs. 2 and 4 show two 4-path 
16X16 Omega Networks (corresponding to the pseu- 
dofactorizations <2,4,2,4> and <2,2,4,.4> of 16). The 
two networks are different in the manner in which the 
four different paths from a source to a destination in- 
teract. The topology of the redundant paths is specified 
by means of a redundancy graph indicated below each 
network. 


2.1 Redundancy Graphs 
A redundancy graph is a flow graph with the follow- 
ing restrictions: 


(1) The set of nodes in the graph is divided into S 
classes corresponding to the S stages of switches in the 
network. 


(2) Each edge connects a node in class 1 to a node in | 


class i+ 1, l<r<S-1. 


(3) The in-degrees of all nodes in a class are the same 
and so are the out-degrees of all nodes in a class. 


Three examples of redundancy graphs are shown in 
Fig. 3. Fig. 3a corresponds to a disjoint path network (all 
redundant paths are disjoint) [7], while Fig. 3c is the 
redundancy graph of a standard (1-path) Omega Net- 
work. The nodes in the graph correspond to switches in 
the network and the edges to links. In that sense the 
redundancy graph is a subgraph of the network and the 
subgraph connecting any input to any output in the net- 
work will be isomorphic to the redundancy graph. The 
number of faults a network can tolerate is given by A-1, 
where Xd is the line connectivity of its redundancy graph. 


The control scheme for setting up the paths in such a 
network is the distributed tag control scheme, much like 
the one for the standard Omega Network. Each stage 1 
is controlled by 6;=log.B; bits so that the entire desti- 
nation tag consists of $)b; bits. The difference, of course, 
is that the destination tag in an R-path network consists 
of n+r bits where r=log,.. Only n of these bits are 
the destination address bits dgd,...d,_, and a particular 
path out of the A alternates is chosen by a specific set- 
ting of the r redundant bits. We go into this in more de 
tail in section 2.2. The broad scheme for using the mul- 
tiple paths is to backtrack, in the event of a fault, up to 
the point of the last fork and then take an alternate 
route. This backtracking can also be done in case of 
blocking along one of the paths. Referring to Fig. 3, it 
can be seen that the graphs for a given S and R differ 
basically in the following three aspects: 


(1) The stage(s) at which fork/join is done 
(2) The magnitude of the fork /join done at each stage 


(3) The number of elon paths between source and 
destination 
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The effect of these variables on the performance and cost 
of the system is considered in [8]. 


2.2 Derivation of NW from Redundancy Graph 


A redundancy graph specifies both the number of 
stages S and the redundancy Rk. The total number of 
bits needed to control the network, irrespective of how it 
is constructed, is n+ r. The only variables are the sizes 
of the switches used in different stages and the distribu- 
tion of the redundant bits among the S stages. Note 
that at any stage the smallest switch that can realize an 
out-degree (or in-degree) D is a D XD switch. 


The terminal an input-output path occupies at the 
output of stage 1 is given by the n-bit window defined 


earlier in (2). Consider such a window W, at stage 1: 


WoWy---Wy_5-1 Wn ...W 
switch 


nl 
terminal 


For a fork size of D at this stage, exactly d=logD of 
the redundant bits ror,...r, should be a part of the 
subwindow w,_,.--W,_;; this will ensure that from the 
same switch D paths will fork out. Similarly, if at this 
stage there are k disjoint paths in the graph, Le., there 
are k nodes at this stage in the redundancy graph, then 
logk of the redundant bits should be a part of the 
subwindow wgW ...W,_5_1- The above two conditions will 
yield the subgraph shown below at stage :: 


(+1 i 


Consider joins now. A join of size D at stage 2 
reduces the number of disjoint paths by a factor of D, 
i.e., d of the redundant bits in window W,_, are replaced 
by destination bits in W;. The 6; tag bits at stage ¢ al- 
ways replace the least significant 6; bits in W;,_,; hence 
the removal of the d redundant bits is achieved by 
choosing an appropriate shuffle to precede stage +. This 
may necessitate choosing shuffles other than the B; #— 


B; 
shuffle. 


Parallel edges correspond to a join immediately fol- 
lowing a fork; in this case, the redundant bits introduced 
in the terminal subwindow w,_,...w,_; In stage 1-1 
should be replaced by d-bits at stage :. For a join in the 
absence of parallel edges, r-bits in the switch subwindow 
WoW ---W,-b-1 Should be replaced (resulting in a net 
reduction in the number of redundant paths). This can, 
in general, be achieved by choosing the shuffle connec- 
tion preceding the stage appropriately. 


Other than the above conditions and the inclusion of 
the right number of r-bits and a minimum number of 
d-bits at each stage, we have a lot of freedom in choos- 
ing the sizes of switches at each stage. 


Consider, as an example of the above procedure, the 
construction of a 4-path 16X16 Omega Network with 
the redundancy graph shown in Fig. 4. We have r=2 
(redundant bits rp and r,) and n=4 (destination address 
bits dy, d,, dy, and dz). Since stages 1 and 2 involve a 


binary fork each, rg has to be a part of stage 1 and rj, 
of stage 2. Stages 3 and 4 involve a join each and hence 
one r bit each should be replaced by d-bits in these 
stages; at stage 3, the r-bit introduced in stage 2 (in the 
terminal subwindow) should be replaced and at stage 4 
the r-bit introduced in stage 1 should be replaced. There 
are many ways to distribute the remaining two d-bits 
among the four stages. Consider the following distribu- 
tion as an illustration: 


d, dod 


This would correspond to using 4X4 switches in the first 
and last stages and 2X2 switches in the intermediate 
stages. Let us determine the connections to precede each 
stage to realize the above redundancy graph. A 4+*4 
shuffle preceding stage 1 would give W,=8384rgdy. Thus 
at the output of this stage, we have two alternate termi- 
nals ($384,0dg and s384ldy) that a path could occupy to 
get to the same destination. A 2+#8 shuffle preceding 
stage 2 will make Ws yrodgr,. Now the redundancy is 
increased to four, with the four paths occupying two 
different switches (s,0d) and s4ld 9). In stage 3, r,; must 
be replaced by d, (to ensure the join and parallel edges); 
hence stage 3 is preceded by the identity connection 
making W,=sy,rodgd,. Stage 4 is just preceded by the 
4*4 shuffle leading to Wy=dod,d_d3, the correct desti- 
nation. This results in the network shown in Fig. 4. 


3. IMPLEMENTATION SCHEMES FOR 
DISJOINT PATH NETWORKS 


rodg Ty 


Disjoint path networks have been dealt with in detail 
in [7]. They provide the highest tolerance to faults 
(among all multiple path networks) - an R-path network 


of this type can tolerate (R-1) arbitrary internal stage 
faults and potentially many more. The set of internal 
switches and links in such a network can be divided into 
R disjoint classes; a path from a source to a destination 
consists of (internal) switches and links from only one 
such class. In addition, the path has counterparts in each 
of the other R-—1 classes. Classes containing no faults in 
essence support the paths that would encounter faults in 
the other classes. 


A modular organization of a BXB8B crosspoint switch 
is shown in Fig. 5a. A connection is established in the 
switch when an input module is connected to an output 
module. When a request is made to an input module 1 
along with a base B address 7, the input module : re 
quests module 7 for connection; if the output module is 
not currently in use (and if it is not faulty - a situation 
we will consider shortly), such a connection can be set 
up. Following this setup, direct connections are available 
between input 1 and output 7 in the switch for all con- 
trol and data signals. The technique to set up a se 
quence of such switches is the set—and—forward scheme. 
In cycle 21-1, stage ¢ is set up and in cycle 22, the ad- 
dress for stage :+1 is passed on to the next stage by 
stage ¢. In 2S cycles, the entire path is set up (if no ez- 
ceptions arise in between). 


Two exceptions could arise in the process of setting 
up such a path - a block, and a fault. A block is sig- 
nalled first by the BL line in the switch where the block 
occurred, and then propagated back to the source. Once 
a fault is detected, generation and propagation of the F 
signal takes place in a manner similar to that of the 
block signal. 
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3.1 Fault Recognition 


Most published research on fault analysis in networks 
((4], {10)) have used as the model of a fault an enttre 
switch being stuck at one of its states. We outline here a 
more realistic model used in [8]. We shall consider each 
module along with the lines leaving it as our untts, in 
the sense that we shall not be concerned with the logic 
within the blocks (Fig. 5c). Such a unit will be con- 
sidered to be faulty if any value of the input vector does 
not produce an appropriate output vector. 


Such a general class of faults can be detected by a re- 
plicatton check, where we duplicate or triplicate every 
module in the switch. (See [6] for such a design.) We pro- 
pose a_ self-checking scheme inherent in our protocol, 
which is less versatile than replication, but requires little 
increase in hardware. The purpose of every control sig- 
nal is to ezerctse a portion of the logic in the receiving 
module. If a plausible response is elicited for an output 
signal (resulting in a proper handshake as specified in 
Fig. 5b), it is reasonable to expect that the portion of 
the logic exercised by that signal works properly. Con- 
sider Fig. 5c. When an output module B initiates a 
handshake by asserting the REQ line, it monitors the 
three input lines ACK, BL, and F. If none, or more than 
one of these is asserted, the unit C is assumed to be faul- 
ty. If B did not assert the REQ line properly as it should 
have, it will not find any input signal asserted and will 
know of a fault. (Note that the return-to-zero protocol 
detects all line stuck-at faults.) Thus each module checks 
a portion of the logic in the next module as a path gets 
set up. When a fault is detected, two actions are taken 
by the detecting module. It makes a note of this by set- 
ting a fault flag within the module. And it signals the 
fault back by asserting the F line which propagates back 
to the source. By setting the fault flag in the module, 
subsequent requests to the module are notified immedi- 
ately of the fault. (Note that the state-updating portion 
of the logic in the second module is not exercised by 
handshaking.) 


We have considered here only the control portion of 
each module. To ensure that data (which includes ad- 
dress and data bits) is transferred properly from module 
to module, an encoding scheme can be used. 


3.2 Fault Notification 


For the purposes of this section it is convenient to 
consider each sequence of the form _ output 
module—link— input module as a single entity called an 
element. The reasons for this are twofold. First, faults 
anywhere within such an element affect the operation of 
the network in the same manner. Second, the terminal 
an input-output path occupies in a stage is given by such 
an element. The network, in this view, consists of 5+ 1 
stages of elements. There are several options for 
notification of faults back to the sources and the subse 
quent action to be taken. 


Non-adaptive Routing: In this approach, each path 
learns of a fault only when it reaches the faulty element. 
When the fault signal reaches the source, it sends the 
same request out on the next alternate path, without re- 
taining knowledge of the presence of the fault along the 
path just tried. Advantages of this scheme are that non- 
faulty paths will be utilized to the maximum extent and 
that there is almost no additional hardware required. 
The drawback of the approach is that a long time could 


be spent trying alternate paths (especially if the faults 
are located in the later stages of the network). 


Adaptive Routing - Notification on Demand: In 
this scheme, knowledge of the location of the faults en- 
countered in various tries is maintained at the sources 
and is used in subsequent routing decisions. A source 
learns of a faulty element upon requesting it the first 
time. It can determine, by keeping count of the number 
of clock cycles after which the fault signal is received, 
where along the path (the stage and element) the fault is 
located. The ID’s of the faults thus determined can be 
stored in a set of B tables associated with the B termi- 
nals that a source could branch into at the first stage. 
Trying a path now entails first checking the appropriate 
table to see if the path would encounter any of the faults 
currently in the table. 


Adaptive Routing - Broadcast Notification: This 1s 
a conceptual scheme that would be difficult to imple 
ment in practice. However the network performance 
under this scheme will provide us a standard to evaluate 
the previous proposals. Under the broadcast scheme, 
notification of a fault is done immediately upon recogni- 
tion (or occurrence) to all the sources that could poten- 


tially use that element. Once the fault ID’s reach the. 


sources, they are handled in exactly the same manner as 
in the demand notification scheme. 


In the two adaptive schemes above, the size of the 
fault tables maintained at the sources is an important 
parameter. When a table of size T is maintained, a ter- 
minal will have to be shutdown when more than T 
faults are reported at that terminal. A larger T increases 
not only the logic complexity and the storage required, 
but also the initial overhead associated with searching it. 
Reducing the table size would close down terminals fas- 
ter leading to increased traffic along fewer paths and ear- 
lier shutdown of the system. The effect of T on the per- 
formance is considered in the next section. 


3.3 Some Evaluation Studies 


Fig. 6 presents the average normalized delay in a 
64X64 network as a function of the percentage of faulty 
elements, for a variety of cases. Normalized delay is 
defined as the ratio of the actual delay to the minimum 
delay. (Faults are permitted only in the internal stages 
of the network.) The request rate m is the probability of 
a source making a new request in a cycle when it has no 
outstanding request; requests are not allowed to be 
queued at the sources. The 2-path network is construct- 
ed using seven stages.of 2X2 switches; the 4-path net- 
work uses four stages of 44 switches. 


Let us briefly account for the shape of the curves first. 
Recall that the set of intermediate elements can be di- 
vided into R classes and that the non-faulty classes sup- 
port the paths that will encounter the faulty elements. 
As the number of faults increases, so does the number of 
paths that a non-faulty class supports. Effectively then, 
the load on the non-faulty paths is much higher with a 
higher redundancy. This accounts for the higher delay of 
the 4-path networks as F gets higher. Breakdown of the 
system is said to occur when all R paths from some 
source to some destination contain faults. To obtain the 
numbers in Fig. 6, we have kept the system running 
after such breakdowns. 


The non-adaptive scheme performs very close to the 
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adaptive scheme and does better than the latter with 
very small table sizes because it uses non-faulty ele. 
ments to the maximum possible extent. The T=0 op- 
tion in the adaptive routing scheme closes down termi- 
nals (or classes) much earlier than necessary leading to 
increased load on the rest of the alternates. This overkill 
results in it performing worse than the non-adaptive 
scheme. 


4. CONCLUSION 


We have, in this paper, introduced schemes for fault- 
tolerance in shuffleexchange type networks based on 
redundant paths between inputs and outputs. The in- 
teraction of the multiple paths between each source and 
destination is specified by means of a redundancy graph, 
and the derivation of a network with a given redundancy 
graph is discussed. We have also considered some practi- 
cal schemes for using the multiple paths in practice and 
shown that some inexpensive schemes provide good per- 
formance in the presence of faults. 
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Abstract 


An NxN (N inputs and outputs) multistage in- 
terconnection network is said to be rearrangeable 
if it can realise all the possible connections of 
the N input terminals to the N output terminals in 
a one-to-one fashion. Starting with the pioneer- 
ing works of Clos and Benes to this date a variety 
of sufficient conditions for rearrangeability 
are known in the literature. In this paper it is 
shown that the well known sufficient condition due 
to Benes on the link permutation is also necessary 
for rearrangeability if the network is made up of 
2x2 switches. 


Introduction 


An NxN switching network (with N inputs and N 
outputs) is an arrangement of switches which is 
capable of performing certain permutations of in- 
puts. The combinatorical power (CP) of a given 
switching network is often measured [2] as the 
ratio of the number permutations the network can 
realise to the total number of possible permuta- 
tions of N inputs. Clearly O<CP<1l. A switching 
network is said to be rearrangeable if CP=1, that 
is, there exist settings of its component switches 
such that, the switching network as a whole, can 
realise all the N! permutations. The well known 
NxN cross-bar switch has CP=1. Most of the early 
studies on (rearrangeable) switching networks have 
been exclusively in the context of telephone net- 
works [1][2][6]. The interest in multistage (re- 
arrangeable) switching networks revived recently 
in connection with the development of parallel 
computers and parallel algorithms [4][13]. At the 
present time there is considerable interest in the 
interconnection network as evidenced by the special 
issue on this topic [8] as well as the extensive 
bibliography in [6][{12]. 

Clos in 1953[1] in a fundamental paper exhib- 
ited for the first time a three stage switching 
network which is rearrangeable. Benes in a-series 
of papers analysed a rich class of switching net- 
works from the point of view of rearrangeability. 
Among many other interesting results, Benes gave a 
method for the design of a multistage switching 
network made up of odd number of stages and with 
square (cross-bar) switches. We call this class 
of networks as the general Benes class (GBC). The 
contributions by Benes are succinctly summarised 
in his now classic book [2]. Most of our nota- 
tions follow those of Benes [2]. Consider the 
following subclass of the GBC defined recursively 
using 2x2 switches. For easy reference, we call 


this subclass as the class of block structured net- 


work (BSN). Let N=2n, 

Definition 1: An NxN block structured network is 
' made of three (3) stages. The first and the third 
stages are each made of 2x2 switches and there are 


N/2 of them in each of these stages. The middle 
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‘Proof: 


stage however, has two copies of N/2 x N/2 block 
structured networks called blocks A and B. The N 
outputs of stage i are connected to N inputs of 
stage itl by an interconnection scheme §; called 
the link permutation i=1,2. Referring to figure 
1, let Xj, X2...Xy/2 and Y,,Y2,..-Yy/z denote the 
2x2 switches at the first and third stages. No- 
tice 2i-1 and 2i are the inputs of X; and 2j-1 

and 2j are the outputs of Ys. In ae following we 
define a special class of link permutations. 
Definition 2: The link permutation $1(@2) in the 
block structured network is said to be distribu- 
tive if the two outputs (inputs) of each of the 
switch X. 4 (vs ) are connected one to each of the two 
blocks A ae B at the input (output). Similarly, 
when the blocks are furthur reduced to three stage 
networks of proper sizes, the link permutations in 
each of these reductions could be required to 
satisfy the above condition of distributivity. 
From theorem 3.9 of Benes [2], it readily follows 
that a block structured network is rearrangeable 
if all the link permutations are distributive. In 
this paper, we prove that distributivity of all 
the link permutations is indeed necessary for the 
class of block structured networks to be rear- 
rangeable. 


Main Result 


Our main result is the contents of the follow-— 
ing: 


Theorem: AL block structured network is rearrange— 


i ens Se ne ee 


distributive. 

The if part follows from the theorem 3.9 
of Benes [2]. To prove the only if part, first 
assume that $1] is not distributive but @9 is. 
Referring to figure 2, let both the outputs of X; 
be connected to the block A. Now any permutation 
which takes both the inputs to the switch Xj; to 
the output of a single switch Y, (say) for any 
k=1,2...N/2 is not realisable by the overall net- 
work. A similar argument follows if @> or $1 and 
$2 are not distributive [15]. 

The above analysis clearly establish the fact 
that for the rearrangeability of the overall net- 
work it is necessary that $j and @) be distribu- 
tive. Now to show that all the link permutations 
that are embeded in the blocks A and B must also 
be distributive, assume without loss of general- 
ity, there is a link permutation which is part of 
the block A that is not distributive. Then, by 
inductive hypothesis it follows that there exists 
a permutation n(on N/2 objects} which is not real- 
isable by block A. Since $} is distributive, 
there exists subassignment 4 of the switches X,; 
i=l to N/2 to the inputs of block A which are maine 
bered from 1 to N/2. That is, referring to figure 
3 ¥1(X%4)=j if an output of the switch X,; is con- 
nected to input jof block A under @. Similarly, 
¥o(r)=¥, if the output terminal r of block A is 


connected to an input of the switch Y, under po. 
Define 
Pas {X,|s=1 to N/2}+{Y,|r=1 to N/2} where 


Pa=¥o.n.¥,. Consider a permutation p (on N ob- 
jects) which takes the two inputs to switch Xj to 
the two outputs of Y; where Pa(Xq)=¥;. This per- 
mutation p is clearly not realisable by the over- 
all network since n is not realisable by block A. 
The proof that distributivity is necessary when 
n=2 (the basis for the induction) is very similar 
to the one presented above. Hence the theorem. 


COMMENTS 


1. In all the examples of the Benes network 
given in the literature [3][9][10], it is assumed 
that @9=,71 and 6, is distributive. Our result 
shows that one can choose $1 and $2 independently 
so long as they are distributive. 

2. One of the interesting open questions in 
the context of the shuffle-exchange-type networks 
is whether or not two passes of the Omega network 
is rearrangeable [14]. Looking at this problem 
from the point of view of necessary conditions, 
we found out, for N=8, by simply rewriting the 
network and fixing the states of certain switches 
as in figure 4, that two passes of Omega network 
is rearrangeable. In particular, the first and 
third switches in the 4th stage of figure 4 are 
set to the "straight" (S) state but the second and 
fourth switches are set to the "cross" (C) state. 
If the state of a switch is fixed, it can as well 
be replaced by permanent connections. The link 
permutation between stages 3 and 5 resulting from 
this elimination of switches in the 4th stage is 
shown in figure 5. The rest of the stages 1,2,3, 
5, and 6 as well as the inputs to the stage 1 in 
figure 5 are all obtained by permuting the switch- 
es in figure 4. A closer examination of figure 5 
reveals that it is a block structured network 
where all the link permutations are distributive. 
Hence, by our theorem it is rearrangeable. Since 
rearrangeable networks remain rearrangeable even 
if the inputs are permuted, the input terminals 
in figure 5 without loss of generality can be re- 
numbered in the natural order 1 through 8 instead 
of as shown in figure 5. It is interesting to 
note that while Parker [14] has also arrived at 
the same conclusion for N=8, he uses brute force 
enumeration method instead. 
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Figure 4... TWO PASSES OF OMEGA NETWORK 
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A FAST ALGORITHM FOR CONCURRENT LU DECOMPOSITION AND MATRIX INVERSION* 


Ming-Yang Chern and Tadao Murata 
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Abstract -- This paper presents an efficient 
algorithm for LU decomposition and matrix inver- 
sion based on the concurrent data-loading array 
architecture. The algorithm performs the LU 
decomposition of a strongly nonsingular matrix A 
initially loaded in the array, in parallel with 
computing the inverse matrices i U~ and Al 
the L> and U-! can be taken out together with L 
and U; and Avl will appear in the array at the end 
of the computation. For an n x n matrix executed 
on the array of the same size, the total time 
required for the above computation is n(t +t,ttg); 
where t_, t_, and ty represent the time for addi- 
tion, multiplication, and division, respectively. 
A simple augmentation of this algorithm can _ lead 
to the solution of a linear system of equations 
without additional time. The performance of this 
algorithm is analyzed and compared favorably with 
an improved version of systolic arrays. 


I. 


Introduction 


Many scientific and engineering problems’ can 
be reduced to the problem of solving linear sys- 
tems of equations (LSEsS). The recent availability 
of low cost, high density, fast VLSI devices has 
opened a new avenue for using processor arrays to 
perform special-purpose parallel computations. 
Efficient algorithms and cost-effective hardware 
structures have been intensively searched. The 
VLSI computing structures related to solving LSEs 
have been suggested by several researchers. Kung 
[3,4] proposed systolic arrays which can be used 


for LU decomposition, matrix multiplication, and 
linear convolution. Preparata and Vuillemin [/7] 
analyzed an implementation of triangular matrix 


inversion from small construction modules. Hwang 
and Cheng [5] presented a complete set of computa- 
tional structures for solving LSEs. This. set 
includes some special modules for LU decomposi- 
tion, solving a triangularized LSE, triangularized 
matrix, and matrix multiplication. 


In Hwang and Cheng’s design, the modulized 
construction units can also be integrated to LU- 
decompose large-scale matrices as reported in 
their partitioned matrix algorithms [6]. These 
modules and linear equation solvers were recently 
applied to a system for image processing and data- 


base management [9]. The usefulness of these 
modules is evident. However, the utilization of 
processing elements (PEs) in these modules is 


still low. Moreover, these modules, like other 


* This work was supported by the National Science 


Foundation under Grant ECS 81-05649. 
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60680 
systolic arrays, suffer from the data reshaping 
problem. The data of a dense matrix must be 
reshaped before being fed into the processor ar- 
ray. The scheme of using on-chip delay latches 


can solve the problem partially, but evoke a more 
severe problem in the array chip interconnection. 


in the above 
Steps, where a 
is used in each step. 


The process of 
designs is divided into 
different hardware module 
Some of these steps can not be executed con- 
currently. For example, the computation of U 
can not start unless the LU decompostion is com- 
pleted. In addition, the output data of a step 
may not be arranged in the same way as required by 
the input of the next computational module. Some 
special provision must be arranged. This may need 
extra hardware and cause extra time delay. 
According to [6], four types of VLSI arithmetic 
module chips are required. Four is not a large 
number. However, it is still desirable to use 
fewer types of module chips. 


solving LSEs 
several 


This paper presents a highly efficient algo- 
rithm based on a Concurrent Data-Loading Array 


Processor (CDLAP) [2]. Only one type of module 
chip is required for the construction of LSE 
solvers. According to our algorithm, the CDLAP 


can perform LU decomposition a little faster than 
Hwang and Cheng’s design [5], and the inverse 
matrices of U, L, and A can all be obtained in the 
same process. Further, the solution vector or 
matrix in a LSE can be concurrently computed on an 
associated CDLAP array. 


A = 
lays], we consider only the case in which all the 


principal minor submatrices of A are nonsingular. 
This provides a necessary and sufficient condition 


For LU decomposition of an n xX n matrix 


to produce a unique lower’ triangular matrix 
L=[{1,.] with all 1,, = 1, and a unique upper tri- 
angular matrix U = tu .] such that L * U = A. 
Both L and U aren x nonsingular matrices. In 


Crout~’s reduction method [10], the matrix A=L * 
U is decomposed according to the following compu- 
tations: 


PO is, SL OZ see eae: 
soar! 
Use = aa 2a Ti ik for i<k ¢ n; 
i-l 
— a! : ° l 
th gt recs 2a testy) as for i < k < n;) (1) 
Ves <te he. x 
11 


Equation (1) can be transformed into another 
form more suited for parallel computations. We 
use a,, to denote the a,, value in the kth recur- 
sion ot computation. setla’ = a,. at the begin- 
ning. Using "“t" as a variable td denote the re- 
cursion sequence from 1 to n, we have the follow- 
ing algorithm equivalent to (1) . 
For t = 1,2, 


for j>t (2.a) 


u for (2.b) 


(t) 
ij 
In each recursion, the above three equations are 
executed in the order (2.a), (2.b), and (2.c). 
(The same convention will apply to the equations 
(4), (6), (8), (10), and (12) in the following.) 
When t equals n, the above computation ends after 
computing (2.b). Note that the corputation of li. 
in (2.b) is not necessary for i = t, since l ic 

Le 
always equal tol. 


for ij >t (2.c) 


Pe 64 


To illustrate the computation of the inverse 
matrices of L and U, let L “= M = [m, 5] and y t= V 
7 [v; 3]. Since L and U are nonsingular, both M 
and V do exist and are unique. It is easily veri- 
fied that M is ann xn lower tria.gular matrix 
with all its diagonal elements equal tto 1, just 
like L. Similarly, V is an n x n upper triangular 
matrix like U. To compute M, we make use of the 
relation L * M = 1, where I is the identity matrix 
of order n. Thus we may write 

mE = 1 for k= 14 2ie.eey 3 


ie 1 
ij . ak k 5 
J k=] J 
As in the previous case, (3) can be transformed 
into the following recursive algorithm: 


(3) 
forl<j<ic¢n 


For t= 1, 2, «svee0.- Os 

(t) (t)_ ; , 

aoa i ie. 0 for a 2:2; 
— (4.a) 

= a for j7<t 

ee as 

(t+1) — (t) _ | , . 

ane = ane itty for i>Pt,j<t (4.c) 
Note that (4.b) is missing. The labeling 


(4.c) is used instead of (4.b) for the convenience 
of later illustration. The computation of V based 
on the relation V * U = I can be written as: 


V l/u for k= Ly Zeki Te 
kk kk b] b) b] b ] 
j71 (5) 
V..= 7 v.,u, .)/u forl<i<j<n 
ij (Lu ik k> j = J 
which in turn can be transformed into the follow- 
ing recursive algorithm: 
For t =. 1, 2, seéseaey Ty 
(t) _ (t) 
Vee i, ne 0 for j>t (6.a) 
_ __ (t) ; 
Vg / ve fori<¢t (6.b) 
(t+1) — (t) ; ‘ 
ee - Vay Yietey for a <.t 4 Fo ¢ (6.c) 
When t equals n, the inversions of L and U 


will be completed after the (a), (b) 


computing 
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portions of the above equations. 


To describe the inversion of matrix A, let. § 
= [$35] = eo Since A = L * U, we have § = ul « 
Lt 24y * M. Thus, for any i and j, 

n n 
S50 = Vv, (7) 
ij Ke ikki onan. 4) ik kj 
Transforming (7) into the recursive algorithm, we 
have: 
Bor <= 1) 2 teas gu, 
(t)_ Ct) es 
it ee 0 for-45 4S. t (8.a) 
(ttl — (t)_ i 
S54 ij itm; fOr 25°. S-* (8.c) 
After the above n recursions, set 
s,, = - sty (8.4) 
1) 13 
+ 
Step (8.d) at t = ntl is added, since ee t=n 


is equal to - s,.. 
13 


IIL. Parallel Computation Architecture 


The Concurrent Data-Loading Array Processor 
(CDLAP) is introduced in [2]. The architecture is 
most suited for the computations in which process- 
ing elements (PEs) in the same row or same column 
share the same operand data as in a ecolumn-row 
vector multiplication. By accumulating resultant 
values in the PEs, the array processor can be used 


for computations involving recursive algorithms. 
The CDLAP has been shown to have an excellent 
efficiency in performing matrix multiplications 
[2]. The one-dimensional case of the concurrent 
data-loading design is reported in [8]. 

The careful examination of the recursive 
algorithms in Section II reveals some important 
characteristics of the algorithms. Consider the 
following three groups of equations: 

{(2.a), (4.a), (6.a), (8.a)} --- group (a) 
{(2.b), (6.b)} “-- group (b) 
{(2.c), (4.c), (6.c), (8.c)} --- group (c) 


The equations in each of the above groups can be 
executed concurrently. For group (a) in each 
recursion, there are always 2n+tl new superscripted 
variables assigned. All these variables are _ set 
to zero except ao and nae which are both set to 
1. In addition, Ehe total number of the variables 
ui, and m,, (without superscript) assigned is ntl 
for any t. ~For group (b), there are exactly n 
divisions sharing the same divisor (operand) u es 
Group (c) has more complex sharing of operands. 
In the recursion of t = k, (n-k) x n and k x n PEs 
are required for concurrently executing equation 
sets {(2.c), (4.c)} and {(6.c), (8.c)}, respec- 
tively. Here each operand 1, or vy; will be 
shared by exactly n PEs. The same characteristics 
can be observed for the other combination of the 


equation sets; that is, n x (n-k) and n x k PEs 
are needed for the equation sets {(2.c), (6.c)} 
and {(4.c), (8.c)}, respectively. Again, each 


operand u.. or m,, will be shared by exactly n 
PEs. : : 


The CDLAP configuration suited for the above 
complex computation requirements is shown in Fig.l 
for the case n = 4. For the LU decomposition and 
matrix inversion of ann x n matrix A, the CDLAP 
must have n dividers and n x n adder/multipliers. 
The PEs in the same row can share the same operand 
and the PEs in the same column can share the other 
common operand. The shifts are in the diagonal 
direction to the upper-left PEs. At the upper 
boundary of this array, ntl bus registers are pro- 
vided. These registers are used to hold the data 
for output and for the concurrent data-loading in 
the corresponding columns. The dividers 
represented by circles in Fig. 1 have similar bus 
registers for output and for the concurrent data 
loading in the corresponding rows. Hence, there 
are totally 2nt+l output channels. (One of them is 


not necessary, since it is always 1 for this com- 
putation.) At the bottom and right boundaries, 
there are a total of 2n+l input channels. The 
channels Ty and Is will always, be set to 1. since 


they eouves pond be m 
input channels will 


and v‘ , and the remaining 
e set to zeros. 


In order to illustrate the computation pro- 
cess on the CDLAP, we may consider the case n= 4 
without loss of generality. The recursive algo- 
rithms in Section II can be extended, grouped, and 


illustrated in the following detailed computation 
steps. 
Step 0: Initial data loading of matrix A: 
(1) : 
i.e. = 1 ; 
j.e., aij ae EOE all i, j 
Step l: Initial shift: 
(1) : 
= = y) . 
“14 41 for J 1, ’ ai 4; 
BZ ml? : (1) : 
m47™1 = 1; Pie 0 for i> 1; 
y (1) _ : ; 
M417 1 Me = 0 for <j > 1; 
, ft) 
ais 
Step 2: Division: 
(1) : 
li = a.) / Wy for i> 1; 
_ me 
Yo 7 Ya / Wy for i= 1. 
Step 3: Multiplication and accumulation: 
(2)_ (1) | er ; 
ee = ae 1s 41; for i, j > 1; 
me = (1) 
- i 1, j = 1; 
ae gee li m for io>l, j 
git oe Vien -Us s fori=1, j>? 1; 
ij ij 5 i ae aay 
(2) (1) ge, 8 
= = = 1. 
Sea Pag aL for i, J 
Step 4: Shift 
(2) , 
= 2: 
Us | a9 4 for j > 2; 
(2) : (2) _ ; . 
M5 = 13 My> = 0 for i> 2; 
_ (2) To 
Do j =m, for j < 2; 
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= ]: 


Step 12: 


(5) _ 
Sig 


Step 13: 


ee 
> ‘2 j 
(2) 

5 s 
Division: 
ee us, 

(2; u 


0 


for 


for i, j 


for 


for i 


Multiplication and accumulation: 


(2) 


A4g a. S94 


(2) 


~ Mis 7 +42 723 


(2) 
Vag 


for i, 
for 
for 


for i, j < 2. 


for j > 3; 
for i> 3; 
for j < 3; 
for j > 3; 


for 


for i 


for 


Multiplication and accumulation: 


(3) 


*54 1i3 "3 j 
(3) 
= m 


ij ek ™3 5 
(3) 


~ Vaq 7 443 "345 
(3) 
= s 


ij “i3 34 
Shift: 
(4) 


945 


lL: 


S43 


Vv 


> 


A) _ 6 


Division: 


4) | 


a4 * “44 


(4) 


for i, j > 3; 

for i> 3, j < 3; 
for 1.<-35-3 2 3; 
for i, j < 3. 

for j > 4; 

for j < 4; 

for i, j <4. 

for i< 4. 


Multiplication and accumulation: 


(4) 


Si Via 4G 


Sign change: 


for i, j <4. 


_ £or’ 15. 4S 4s 
ij = 
According to the above computation steps, the 
operation of the CDLAP is repetitive and regular 
except the initial data loading and the sign 
change in the last step. In each recursion, the 
three steps corresponding to equation groups (a), 
(b), and (c) are executed sequentially; shift 
first, then division, and then multiplication and 
accumulation. For each shift step, all data will 
be moved one position in the upper-left direction. 
The labeling of each datum shifting within the 
(n+l) «xn PEs is not changed. The initial values 
of all variables rae m Jt) | and s‘t) are set to 
either 1 or O at the time these efter the array. 
On the other hand, the data a,':’ and m,.’ become u, . 


tj Cy cea 
and ™,. at the time they enter the output regis- 


Ss, e 
1) 


ters at~ the upper boundary. Similarly, after the 
division step, the 1, and vz; values are sent to 
the output registers. The ata in the output 


registers are also used as the operands in next 
step of multiplication and accumulation. 


| Fig. 2 shows the data flow of the above com- 
putation process. The four snapshots display the 
position of variables at the beginning of Steps 3, 
6, 9, and 12, respectively. The L, U, L~, and 
u-! values can be obtained from the output ports, 
and the inverse matrix A ~* will appear in the 
array at the end of the last step. Detailed input 
and output sequences are shown in Tables 1 and 2, 
where the last two columns under Tas and Oa; will 
be refered to later in Section Iv. -” . 


IV. Solving Linear Systems of Equations 
Consider a family of linear equations A * X = 
B, where B = [b,.] is a given n x m matrix and X = 
; ij : 2 
[x, .] is ann x unknown matrix. When m = 1, 
ther B becomes a vector b = [b,] and X becomes an 
unknown vector x = [x,]- 


Since A= L* U, we have L * U * X = B. Let 
U * X = D = [d,.]. Then we have the relation L * D 


= B. Using thid relation, we can write 
i 
re 2a teks 
Since 1,, = 1, we have 
ii ; 
Lee 
di5- bay “2a gage for any i and j. (9) 
Equation (9) can then be transformed into the 
recursive algorithm shown below. 
Port) Si ly ..26 | dees 5 Ty 
(t) 
= O. 
qi Peg (10.a) 
(t+, (t)_ 
are ai ittt4 foriodt (10.c) 


This computation, similar to that for LU decompo- 
sition, will end at t = n after completing (10.a). 


For U * X = D, the algorithm is similar. 
Using the relation X = U **D = V * D, we can write 
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(11) 


n 
= ‘ i d j , 
Xj ys Vie for any i and j 
k=i 


Transforming (11) into a recursive form, we have: 


For t= 1, 2, «eee. 3: NS 
a8 aoe 
tj 
aa x St) _ Veocd= 3 fori<t (12.c) 
ij 1j it tj — 

After completing (12.c) at t = n, the following 


sign change must be done to obtain the correct x 
value. | 


i a for all i and j. (12.d) 
ij ij 
To perform the above computations, an extra 


n x m CDLAP will be needed. This extra array must 
be associated with the original array from which 
the 1, and v;, data are transferred to the hor- 
izontat data buses of the associated array. The 
data of matrix B are initially loaded into this 
associated array and the shift must be in the "up" 
or north" direction. The configuration can be 
implemented as shown in Fig. 3. The detailed 
steps for n = 4 are listed in the following (where 


oe hg 2. abe we 5 Mh) 
Step 0: Data loading: 
Ue Dey for 1.= 13-2, 3,4 
ij 1) 
Step 1: Initial shift: 
_ a) (1) _ 
dij = ae ; a 0 
Step 3: Multiplication and accumulation: 
p'?) ee nes, 3 £0r A > As 
ij ij dl 23 
xf?) = roe Vieds.y fori<l. 
ij ij gE Oa ee = 
Step 4: Shift: 
se? (2) _ 
O54 = bo 3 ; Xo 4 8) 
Step 6: Multiplication and accumulation: 
ss uae pa: Pee for i> 2. 
ij ij 12.24 
x!) Ae Vi aus for i < 2. 
ij ij i2°2j - 
Step 7: Shift: 
_ (3 (3) _ 
dy = b35 : 34 0 
Step 9: Multiplication and accumulation: 
(4)_ , 3) ; 
bay = we 15343; for i> 3. 
xt? = ne Vigda for i < 3. 
ij ij i3 3j = 
Step 10: Shift: 
_ , 4) (4) _ 
re = ue ; X44 0 
Step 12: Multiplication and accumulation: 
x = oe v,,d,, for i< 4. 
ij ij i4 4j 2 


Step 13: Sign change: 
ies g) ford = 14. 24. 34 4a 
ij ij 
The numbering of the above steps follows the 


system in Section III. Fig. 4 shows the data flow 


of this process in the asociated array. Referring 
to Tables 1 and 2, the input and output sequences 
are listed under I,. and 0,., respectively. The 


; 5 A 2 
corresponding time unfts can Be easily verified. 


To solve A * x = b directly, the associated 
array is actually not necessary. The computation 
process is the same as LU decomposition except 


that the b data must be input through the channels 


I y to I3 at step 1. The snapshots of data flow 
for n = 4 are shown in Fig. 5. At the end of Step 
13, the solution vector x is obtained in the ar- 
ray. This scheme offers an efficient way to solve 


LSEs (for m = 1) without extra hardware. 


V. 


Estimation of Data Broadcast Delay 


In analyzing the time required for one recur- 
sion of the above computations, it is necessary to 
consider the time required for setting up data 
signals on the data-loading lines before the data 
can be loaded. This data broadcast delay may be a 
controversial point about the CDLAP. However, the 
following estimates of this delay on some practi- 
cal design indicate encouraging results. The phy- 


sical parameters used in this estimation are as 
listed in Mead and Conway [1]: 
Resistance Capacitance 
Metal 0.03 ohms/O 0.3 * 1074 p£/um2 
diffusion 10 ohms/O 1 * 10° pf /pm 
poly 15-100 ohms/O 0.4 * 10 “pf/pum 
gate-channel 4 x 1074 pf /pm2 


Assume that we have a VLSI array 
square, PE of 1 mm square, metal 


chip of 1.2 
line of 6 pm 
(32>) wide, metal line space of 6 pm, and 16 bits 
per word. The data-loading buses are mostly made 
of metal lines and partly connected by polysilicon 
or diffusion layer as shown in Fig-.6. Then the 
width of the one data-loading channel would be 
about 200 pm or 0.2 mm. For the horizontal load- 
ing lines, there is a 0.2 mm diffusion link for 
every 1 mm of metal line. The diffusion link is 
assumed to have the same width as the metal line. 
Under this situation, the R *C, value of a verti- 
cal data line would be 0.13 nsec (Ry = 60 ohms, C\= 
2.2 pf); while for a horizontal line, it is about 
10 nsec (R,= 3400 ohms, C,= 3.0 pf). Here, R, is 
the resistance along the data line, and C, is the 
capacitance due to the data line es The 
R,*C value gives us a rough estimate of the in- 
trinsic time delay due to the data bus itself; it 
is about two times that of the real time constant. 


ecm 


Two-layer metal lines were used in a_ recent 

[11] in which a 32-bit processor was fabri- 
cated. This technique allows smaller chip area 
and easier layout cross-over. Using this tech- 
nique in our implementation, both the horizontal 
and vertical loading buses will have the same high 
speed. Thus, the time delay is mainly caused by 


work 
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the output impedance of the driver circuit and the 
capacitance of the gates and non-metal pathways. 
For the layout shown in Fig.6, the longest diffu- 
sion link needed to connect the data bus to the 
input buffer of a PE is about the layout width of 
the data bus, about 200 pm in our calculation 
above. Let Rg be the resistance along this diffu- 
sion link, and Cg be the capacitance due to the 
diffusion link and its associated gate area. 
Assume this diffusion link has the width of 4 um 
(2 dr). Then the maximum Rg is about 340 ohms. 
The maximum capacitance due to this diffusion Link 
is 0.08 pf. If the area of the gate and other 
diffusion regions connecting to this link is 
within 200 pm* (which should be large enough), the 
Cg value for one PE would be less than 0.16 pf. 


Taking a practical estimate, the output 
resistance of a driver circuit in the chip is 
assumed to be 1K ohms. The maximum resistance R 
to any gate area is thus 1000 + 60 + 340 = 1.4 K 
ohms. The total capacitance C associated with the 
single data line would be within (2.2 pf + 0.16 pf 
* 10) = 3.8 pf. Thus the R * C estimate of the 
time delay of the data bus is only 5.3 nsec. For 
the above matrix operations, the time required for 
one step of computation (one addition plus one 
multiplication, or one division) would be at least 
one order of magnitude longer than this delay. 
Therefore, the time required for the data broad- 
cast is negligible. 


VI. Performance Analysis and Comparisons 


The advantages of the algorithm in this paper 


can be easily seen by comparing the required 
structural complexity and time efficiency with 
those of other array processors performing the 


same functions. The systolic algorithm for LU 
decomposition presented in [1] is not as efficient 
as that in [5]. In Kung’s paper [12], only the 
net processing time is counted. The LU decomposi- 
tion needs n(t +t +t,) processing time. While for 
solving A * x 2 b (by back substitution), it takes 
an additional n(t +t +t,) time, disregarding the 
data arrangement™ prob ems. Here, t_, t_, and t 

represent the time for one addition, multiplica- 
tion, and division, respectively. 


A close examination of the CDLAP shown in 
Fig.l shows that the first two columns can be 
implemented as one column, since the dividers are 


not busy at the same time as the remaining PEs. 
On the other hand, the diagonal shift can be split 
into the vertical and horizontal shifts. And 
these vertical or horizontal shift lines can be 
used for loading and unloading n* data inn shift 
steps without matrix data reshaping. Since the 
data shift is fast, the memory data rate is the 
most probable speed bottleneck, which must be at 
least fast enough to accomodate the I/O data in 
one unit of computation_time. Therefore, the ini- 
tial data loading of n@ data would take less than 
n units of computation time. 


With an appropriate provision of cache memory 
or buffer registers, the initial data loading time 


would be very short, just like the shift time in 
the case of systolic arrays. Thus we neglect’ the 
initial data loading time in our later analysis. 
The turnaround time required for LU decomposition 
in our design is (n-1)(t_+t_+t,). This is less 
than that in [12]. For solving A* x = b, our 
algorithm needs only n(t_+t_+t,), while that in 
: a mid 

[12] requires 2n(t +t +t ) counting only the net 
processing time. atrix inversion takes 
n(2t +3t +t.) in [12], compared with only 
n(t. t tts) turnaround time in our case. 

For the convenience of analysis, 


we assume 


that t = t. + t. It is also reasonable to as- 
sume that the time delay due to a shift, data 
broadcast, or sign change is negligible. A time 


unit can thus be defined as the time needed for a 
division or a multiplication plus an addition. 
The total time units spent up to each step is also 
shown in the 2nd column of Tables 1 and 2. From 
Table 2, it can be found that the computations of 
Lig: “Uy “erg ul, and A! will take 2n-3, 2n-2, 
2n-2, 2n-l1 and 2n time units, respectively. 


Hwang and Cheng’s paper [5] offers a complete 
set of modules for constructing the LSE solvers. 
Their results are summarized in Table 3. Their 
turnaround times for obtaining U~, A, and solv- 
ing LSEs from A are also computed and listed in 
the entries from (8) to (11) in Table 3. Their 
computation of U- has to wait until the comple- 
tion of U, while for A ~ the completion of U~ is 
required. 


The performance of our algorithm and_ the 
required CDLAP structural complexity are also sum- 
marized in Table 3. The result of comparison is 
self-explanatory. Our design uses less hardware 
and less time to obtain Ll, U +, or al, and to 
solve the LSEs. Further, only one type of module 
is required in our design, since the vertical 
/horizontal shift implementation of the CDLAP has 
the capacity of the associated array in Fig.3. 


VII. Conclusions 


The algorithm presented in this paper offers 
an efficient way to perform LU decomposition and 
matrix inversions concurrently in the same array. 
For an n xX on matrix, it requires only n* PEs. 
This algorithm uses fewer PEs and takes less time 
than existing algorithms on systolic arrays to the 
best of our knowledge. In addition, an n x m unk- 
nown matrix of an LSE can be computed by attaching 
an n x m associated array. The computation of the 
unknown matrix can be completed in parallel with 
the above process. All of the above can be 
achieved using only one type of module, which is 
suitable for mass production. 


The processor array on which the present 
algorithm is executed is a version of the CDLAP. 
A similar structure can be used for matrix multi- 
plications [2]. In all the cases mentioned so 
far, no data reshaping problem is encountered for 
dense matrices. This architecture has shown very 
encouraging results for the applications 
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considered so far. More applications 
investigated and will be reported later. 


are being 
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Fig. 4 The data flow on the associated array 


Fig. 1 CDLAP configuration for LU decomposition ‘ : 
of Fig. 3 for solving A * K = B. 


and matrix inversion; where squares, rectangles 
and circles represent PEs, output registers, and 
dividers, respectively. 
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Fig. 2 The data flow on the CDLAP (of Fig. 1) 
for LU decomposition and matrix inversion. 


metal lines for 
16-bit data bus 
awed wm, 


Fig. 3 The CDLAP configuration for solving 
a family of linear equations. Fig. 6 A layout design for the data-loading buses 
(where dotted lines represent the diffusion links). 


85 


ON eam Ae. 
ieee ee 
Oe a, 


Table 1 : | Table 2 : 
Input Sequence for LU Decomposition on the CDLAP Output Sequence for LU Decomposition on the CDLAP 
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Table 3 : Summary of Structural Complexity and Computation Time 


[* For the design in [5] no. of no. of start-up net com- turnaround 
required 1/0 time putation time 
| PEs channels time 
es LU decomposition ee 4n-2 n-l 2n-1l 3n-2 
(2) Triangularization ne =1 4n n 2n 3n 
of LSEs 
|(3) Triangular linear n nt+2 n 2n-1 3n-1 
| system solver D 
hae Inversion of U (n*+n) /2 2n 1 2n-1 21 
|(5) Inversion of L (n?-n)/2  2n-2 1 2n-3 2n-2 
(6) Matrix multiplication In2-n 4n-1 2n-1 2n-1 4n-2 
(to obtain A+) 
(7) Solving a family of n2 n7+2n n n 2n 
LSEs using A 
* The combined results turnaround 
from the above design time 
((8) Solving A * x = b [(1)+(¢3)] 6n-3 
(9) Obtaining uv! from A [(1)+(4)] 5n-2 
(10) Obtaining Av! from A [(9)+(6)] 9n-4 
(11) Solving a family of LSEs using Al, 11n-4 
with A ~ computed from A.  [(10)+(7)] 
**k The algorithm based on 
CDLAP in this paper: no. of 1/0 initial (after loading) 
| PEs channels logding turnaround time 
(12) LU Decomposition only 3n-1 data 2n-2 
9 in.n steps 
(13) LU aaa mae n 4ntl n”~ data 2n-1 
plus L ut 1 9 in,n steps 
(14) The auc plus A> n 4ntl n- data 2n 
, a n steps 
(15) The above plus solving n“-+nm 4n+1+2m +nm data 2n 
A * X = B, with B of n x m. 9 in R steps 
(16) Solving A * x = b n 4ntl n° data 2n 


in n steps 
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VECTOR COMPUTER FOR SPARSE MALRIX OPERATIONS 


Gao Qing-Shi, Wang Rong—Quan 
The Institute of Computing Technology, Academia Sinica 
Beijing, China 


Abstract 


A vector computer architecture 
matrix operations is introduced in this paper. 
apart from having all the functions of an ordi- 
nary vector computer, it can also efficiently 
make operations on the non-zero elements of the 
Sparse vectors and sparse matrices in ptpeline 
manner. In comparison with the execution of 
sparse matrix operations in an ordinary vector 
computer,the computation speed can be increased 
several times or even over ten times. 


On the basis of extending the standard high- 
-level language to the vector high-level langu- 
age, the further extension to the sparse vector 
high-level language, the basic operations of 
sparse vector and sparse matrix and its imple- 
mentation in machine are discussed, 


Raising the problem 


A large number of applications, such as li- 
near programming,the numerical solution of dif- 
ferential equations, structural analysis, net- 
work analysis, genetic theory, behavior and so- 
cial science and so on, require solving the 
problems of higher order sparse linear equa- 
tion. With the rapid development of technology, 


large-scale sparse matrices will be constantly 
encountered in the apolication related to 


large-scale system problems. 


The Sparse Matrix (SM) described in this pa- 
per refers to not only the matrix of which the 
non-zero elements take a small percentage, such 
as less than 5%, but also the matrix of which 


the non-zero elements take a considerable ratio. 


SM technology has been stydi ed 
traditional computers and ordinary vector com 
puters {2]—{[4]. Utilizing the SM technology, 
only non-zero elements are stored and calcula- 
ted, and this results in great reducing of the 
needed storage space and raising of the compu- 
tation efficiency. But these discussions are 
from the angle of algorithm, Although SM tech- 

nology can speed up calculation considerably, 
when non-zero elements of SM are calculated it 
is necessary to spend more overhead such as co- 
mparing, discriminating and controlling of su- 
bscripts of non-zero elements, and this results 
in treating an ordinary pgpeline-vector compu- 
ter as a traditional scalar computer only. The 
Vector Computer for Sparse Matrix Operation 
(VCSMO) uses hardware to implement the above 
overhead operations from the angle of architec- 
ture. Thus the non-zero elements of SM can be 
effectively calculated in pipeline manner. The- 
refore,.the VCSMO introduced in this paper can 

be several times or even over ten times faster 
than an ordinary vector computer in the imple- 
mantation of SM operations under the same tech- 


carefully in 
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nological conditions. 


the Notation of sparse Vector and 
oparse Matrix 

The Sparse Vector (SV) can be represented as 

triad (%,¥,l],wWnere @ is a monotone-increasing- 

-intezer-vector called soarse-integer-control- 

~vector which is composed of the subscripts of 

non-zero elements of SV, ¥ is a data vector 


which is composed of the non-zero elements OL: 
SV, l is the length of § or y. 

In the assembly language a vector can be 
represented with [b_,1] and a SY with a triad 
[Dg ,Dy,1], where Dg’and D. are the initial ad- 
dresses of # and ¥ respectively, and 1 is the 


length of § or ¥. It should be noted that 1 
varies constantly in the operation. 6 and yY are 
usually strored in sequence. 


The SM can be represented witha triad 
(8,Y,1], which is compdsed of a group of order- 
ed sparse row vectors [@[i,*],YCi,¥*)],1(il], 


ist,2,°**,n. Generally Speaking, if i#} then 
Ls Ley. 
In the assembly language, the sparse row 


vector is represented with [ Dfil , Bails. 2fa7], 

where i=1,2,°°*,n, DgliJ and Dyfi] are the ini- 
tial addresses of the row vectors @(i,*] and 
Y(i,*] respectively. Therefore in the assembly 
language the SM is respresented as [ By Dy, 1) , 

where Dy =(D,11 , Df21,°+*,D{n}] and Dy=[Dytt], Dye), 
"++, Dyin). 


Generally speaking, Band Y are continuously 
Stored row by row and in sequence in the main 
memory. Therefore, SM can also be represented 
as (Dp,Dy,I] in assembly language, where Dg and 
Dy are the initial addresses of B and Y respec— 
tively. 


Non-sparse vector can use a flag bit 
tead of B- 


ins— 


SV and SM can also be reoresented with 
quadruple [@,¥,1,L] and (€,Y,1,L]. Where L is 
upper bound of the length 1, D=(E£1,fr23,++° , 


e 
J 
L{nj) and L{i] is upper bound of l{i], the mea 


T 
if 
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ning of other parameters is the same as the 
triad notation. In some cases the calculation 
can be speeded up when the quadruple notation 
is used. 
The standardization representation of zero- 
-vector or zero-—Matrix is blanks. 
The Extension of Standard Vector 
High-Level Language to SV and SM 
The High-Levle Language (HLL) system for 
SM operations must be extended on the basis of 


the standard HLL. First it must be extended to 
the standard Vector HLL (VHLL){iJ,and then fur- 


ther to the language which includes svarse 
tor operations and sparse matrix operations. 


vec= 


The Extension of the Declaration Part 


On the basis of extending HLL to standard 
VHLL, we introduce SPARSE VECTOR and SPARSE 
ARRAY, the type of which can be REAL or INTEGER 
or BOOLEAN. The mode is as follows; 


aEAL ARRAY [ <identifier>, 
INTSGER ¢ SPARSE 
BOOLEAN VECTOR [<identifier>, 


<identifier>](a,zb »2,26,3N ] 
(identifier >JLa!b;N] 

where "{}" denotes "or", The first identifier is 
the name of the sparse control matrix or sparse 
control vector of which the corresponding type 
is the integer. The second identifier is the 
name of the compressed matrix or compressed vec- 


tor.N is the upper bound of needed storage space, 


+1) is number of rows and 
number of columns of original matrix respective— 
ly, b-a+1 is the length of original vector. Se- 
veral sparse arrays or sparse vectors are permi- 
tted to be declared all at once. 


(b,-a,+1) and (b,-a 


What requires our attention is that the co- 
lumn SV of SM is stored very irregularly and can 
not be noted directly with(@(*,j],Y¥(*,3j),1(3]]. 


The Sequence of Operations 


The priority of +,-,*,/,4,yA».V» comparison 
and other operations are the same as that in the 
standard HLL. 


Sign, Abs, Max, Min, |Max| (absolute ma). 
[Min] (absolute minimum), #B( standardization), 
LI( lower integer),[{ (upper integer), { }] (decimal 


part),#Z(extract main-diagonal vector of matrix), 


HIV(Inverted Sequence 
Wliterative addi- 
etc. 


#MI( matrix inversion), 
Vector), #E(extract a element), 
tion of vector), #RT(retun operation) and 
have the highest priority. 


#IV and #MT(matrix transpose) and { belong to 
the same priority. ¥(inner product of vector),SM 
multiply SV and *,/ belong to the same priority. 
#HE( Sign replacing), #H+(Sign-bit addition) and 
+,- belong to the same priority. +(NOR) webs 
belong to the same priority, A(NAND) and A (AND 
belong to the same priority. For the detail of 
the above operations, see next section. 


The Sparse Vector Expression and Sparse Matrix 
Expression 
If operation result of an expression is the 


expresentation form of SV or SM, the expression 
is respectively called the SV expression or 5M 
expression. The 5V expression or 5M expression 


can make operation connections with the corres- 
ponding non-sparse expression. The expression 
formed after the connection is called sparse ex- 
pression or non-sparse expression depending on 
whether its operation results are sparse or non- 
-~sparse vectors (matrixes). 


Assignment 

SV and SM can be assigned with SV expression 
and SM expression individually. Besides, non-SV 
and non-SM can also be assigned with SV expres- 
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sion and 5M expression individually and this is 
return operation. SV and 5M can also be assig-— 

ned with non-SV expression or non-SM expression 
individually and this is standardization.Non-5V 
and non-SM can be assigned with a constant, 
but in general, SV and SM can not be assigned 
with constant. 


otandard Functions 


In addition to normal standard functions of 
general HLL,further extended standard functions 
would contain length of original vector, number 
of column and row of original matrix, length of 
compressed vector, vector consisting of lengths 
of all compressed row-vector of matrix and 
needed maximal storage space of SM or SV. 


Basic Operations of HLL 


The SV computation system should include 
non-SV operations, scalar operations, SV opera-— 
tions and the mixed operations of non-SV, SV 
and scalar, All the operations can be made un-— 
der the control of the operation control vector 
(1]. The discussion here is concentrated on ba- 
Sic operations of SV and SM. 


Standardization 

[ Ba2¥zotn]s=HB(( By oF 01,1)- 
Where 1 is the number of non-zero elements of 
¥ 4s y, is a@ compressed vector eleminated zero- 
_dleménts of Y49 Bz is a corresponding subsc- 


ript vector. 


The Operations of Addition, Subtraction, Multi- 
plication, Division, AND, OR, Exclusive-OR, NAND 
and NOR . 


Bao tshettanF tale Yp 


Where operator @ is +,-,*,/;,A,V.@,A and¥. When 
@ is "/" the second operand has to be non-zero 
scalar or vector without any zero-element. The 
result of NAND or NOR is usually non-SV. 


Comparison | 

The result of SV comparison operation (>,>, 
=,<,<) is a sparse bit-vector (zero-elements 
must be considered). The result of all satisfy- 
ing-comparison operation of SV is a bit-scalar. 


Inverted Sequence Vector @IV ) 


Assume K=(x4,X59°°%yX,)s #IV(X)=(x 4x, ’ 


29 4x,) is a inverted sequence vector of X . 

[ @5°Yzr1,]2= #IV(G,,7451,). 
Where B3=b+1- #IV(3,)5 i x= HIV(Y,)> L is the 
length of original vector [$4 9V401,]- 


Other Operations 


The result of vector inner-product is a 
scalar. The results of Iterative-addition of 
vector (xf), NOT(—) and Return Operation (#RT) 
are ysually non-SV. In Max and Min operations 
zero-elements must be considered. In addition, 
there are #HF, 4H+, Sign function (Sign), IMax), 
IMin|,ADS,| Js [ sq Js Extract subvector, Extract 


an element of SV and Transmission, 


Most of the above operations can be extended 
to SM. Besides, the other basic operations of SM 
also include #MI, #MT, #Z, Extract the submatrix 
of SM, Multiplication of SM and SV or of SM and 
vector, k-th power of SM and so on, 


Implementation of Basic 
Operation of SV and SM 


oystem Hardware Structure 


system hardware consists of a very large me- 
mory, a memory controller, instruction control 
unit and sparse vector ALU. The block-diagram 
is shown in Fig.1. 


Pipeline 
unit 
Data 
lookahead lookahead 
buffer |lowrrer 

| 
| | Microcontrol unit | 


Memory controller 


Main memory 


tion 
control 
unit 


comparison 
unit 


i 
Intezer 


Fig 1. Block diagram of system structure 


Accoding to the character of SV and SM, the 
vectical processing pipeline should be adopted. 
Therefore, one of the keys of hardware implemen- 
tation is to provide high-speed access to the 
memory to ensure data access rate needed for 
integer comparison unit and pipeline unit. To 
solve this problem we can use cache or multi-bus 
multi-~bank interleave access technology. 


Vector Processing of the VC5MO 


The descriptor of a SV is used to provide the 
parameters of SV. The descriptor is stored in 
several successive locations of memory. It is 
placed into index registers before processing. 


There are two Vector Parameter Fijes,VPF, and 


VPF,, which are especially used for processing ™ 


vector operations. When one VPF is being used 
for processing the current vector instruction, 
another VPF can be used for preparing the para-— 
meters of next vector instruction. This struc- 
ture can minimize the effects of set-up time 
on computation speed in the vertical processing 
pipeline. 

An €xample of Implementation of the Basic Opera- 


tion in Machine 


The basic operations mentioned in the pre- 
vious section can be implemented directly with 
hardware, For example, Fig.2. presents block die 
gram of machine implementation for operation of 


8 is sl, |s=(2 oF Rne¥ 
ues a) (9499424) OL BorF oo] 


be +,-, V and @ operation. 


» where 6 may 


2] 
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i 7LAj> 1,2 


(414 LAj< Apt) > B35) 


(i<Aj<Lnghl<Bo) 
VC irLAj<le 


V(iglAjzlz ) 


yal Ke =i U5 
Bx kk =6,C 1) 5 
ks=k4+13 

L:=i+1 


ur=y, (Oye 5} 

js=Jj4+13 

if u#0 then 

begin ysl J: =U; 
sk): = 8,(1); 
Ke=k+1; 

is=i+t+1 


Vs (kl =O8¥;, (5h 
$31: = Bal} 

k : —K+1 ; 
js=j+ 


es 


Pig. 


Conclusion 


fhe hardware cost of the integer compurison 
unit of a VCSMO is very small in comparison 
with the pipeline unit. Nevertheless, this ma- 
kes it possible for the non-zero elements in the 
SV and SM to be operated in high-speed pipeline 
manner. Therefore, in comparison with the ordi- 
nary vector computer, the VCSMO increases the 
speed of SV operations and 5M operations by se- 
veral to over ten times. The times of speed-up 
is closely related to the distribution and ratio 


of non-zero elements of the SM of a_ specific 
problem. 
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EFFICIENT MATRIX MULTIPLICATIONS ON A CONCURRENT DATA-LOADING 
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Abstract -- This paper introduces a VLSI- 
compatible architecture called concurrent data- 
loading array processor (CDLAP). Many matrix 


operations on systolic arrays have the matrix data 
reshaping problem. The CDLAP can execute the mul- 
tiplication for dense matrices without data 
reshaping. A partitioned multiplication algorithm 
is also presented for matrices larger than the 
array size. Based on the design in this paper, 
the utilization of processing elements is virtual- 
ly the best achievable for large matrices. The 
CDLAP, with small variations, can be used for band 
matrix multiplications. The performance, taking 
into account the total computation time and data 
transfer bandwidth, is found better than systolic 
arrays. 


I. Introduction 

In recent years, VLSI architectures for high- 
ly parallel computations have been extensively 
investigated [1, 2]. The systolic array structure 
for matrix operations was first proposed by Kung 
[3-5]. Various versions of systolic arrays 
designed for different applications have been pro- 
posed. The computational structure for solving a 
linear system of equations (LSE) presented by 
Hwang and Cheng [6] involves some special modules 
for LU decomposition, solving a _ triangularized 


LSE, triangularized matrix inversion, and matrix 
multiplication. More recently, the wavefront con- 
cept and the wavefront array processor (WAP) were 


proposed [7]. Although the asynchrony and _ local 
memory of the WAP offer more flexibility in compu- 
tation, the data flow of the WAP is basically the 
same as systolic arrays. 


All the above systolic-type array processors 
for matrix operations suffer from the common prob- 
lem of data arrangement. For dense matrix opera- 
tions, the conventionally arranged matrix must be 
reshaped before being fed into the processor ar- 
ray. By providing on-chip delay latches, the data 
arrangement problem can be solved partially, but 
this evokes a more severe problem in the array 
chip interconnection. The on-chip delay scheme is 
suitable only for the case of a one-chip array, in 
which case the array size is too restricted. 
Secondly, the chip 1/0 circuit for a systolic 
array is complex in order to achieve an acceptably 


easy interconnection of array chips [8]. In gen- 
eral, the utilization of PEs in the above men- 
tioned arrays is half or less for matrix 
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operations. This leaves some space for further 


improvement. 


The use of global data communication, togeth- 
er with the systolic scheme, in a linear convolu- 
tion array has been presented by Kung [1l]. Huang 
and Abraham [9] proposed some algorithms for ma- 
trix multiplications. For the case of dense ma- 
trices, one of their algorithms makes use of a 
one-dimensional data broadcast scheme and has an 
improved performance. However, it still has the 
matrix reshaping requirement and the utilization 
of PEs is only 50%. 


This paper presents an array architecture 
which removes the above mentioned deficiencies. 
In this architecture, a data broadcast scheme in 


two directions across the array is introduced. 
From the functional point of view, we categorize 
such a computational array as a concurrent data- 
loading array processor, or CDLAP for short. | 


Il. Dense Matrix Multiplications 


The CDLAP for a dense matrix multiplication 
is configured as shown in Fig. 1. It is an array 
of n x n_ processing elements (PEs). The data, 
upon arrival at the data bus of each column or 
each row, can be concurrently loaded in or broad- 
cast to all PEs in that column or row, respective- 
ly. The functional block diagram of the PE is 
shown in Fig.2. Let the accumulator value of the 
PE at the ith row and jth column be denoted by d,. 
; D= [d, .] will represent the accumulator matrix 
of the prdcessor array. For the initialization of 
a computation, all accumulators can be set to zero 
by a common reset line. When the multiplexer M 
selects u, as the input of d,., the accumulator 
can acquite data from its right-neighbor PE. In 
another instance when the multiplexer M_ selects 
the output of the adder, the product of x, and y, 
will be accumulated for each step of computation, 
i.e., d,., <- d,.+ x,y,. There are two schemes to 
out put the accumtdlator/data. The first is to con- 
nect the d,., output directly to the neighboring 
PEs on the tdét side; this output channel is 
denoted by u - The second is to transfer d,, to 
a buffer reg?ster t.,. which will later send ~out 
the data via the horizontal bus at an appropriate 
time. Under this scheme, the horizontal bus is 
used alternatively for input and output. The 
tri-state output of t.. is usually open-circuited 
unless the “output nable" is on. For the t- 
registers in the same row, the array can enable 
the output of only one at a time. To keep control 
simple, the array is designed such that all PEs of 


the same column will output their data at the same 
time. 


In order to illustrate a dense matrix multi- 
plication, consider A = [a and B = [bj both 
N x N-~ matrices. The ee N x i sae is 


denoted by C = [¢; 5) -= A * RB, 


Case lL Assume that N= n. Let A. be the 


ith column vector of A and B, be the jth row vec- 
tor of B. Then we may write 


n 
es > A 
k=1 


This matrix multiplication can be carried out inn 


recursions, executing 
k) k-1) 
= + k 1 
Dv= 3g A. * B, (1) 
recursi ively for k = 1,2, ,n3; where D* = ON. 
and D = C. The input wave arrangement is shown 
in Fig.l. Note that the input operands A, and B 


are loaded concurrently so that the (i,j)th PE has 
the input operands a;, and by. at the beginning of 
the kth, recursion in’ 1) pSting the kth recur- 
sion, n” multiplications aad then n~ additions are 
performed at the n“ PEs, respectively, in parallel 
with loading A and B,,, on the bus lines for 
the (kt+l)th PecueS Ton At rhe end of this compu- 
tation, when k = n, the value of matrix C will 
appear in the accumulator array D. 


The data arrangement in 
concordant with the conventional 
matrices 


this algorithm is 

way of storing 
and thus eleminates the matrix reshaping 
problem. The time required to complete this ma- 
trix multiplication, with the result staying in 
the array, is HCE tt) where t. and t., represent 
the time to perform one addition and one multipli- 
cation, respectively. The resultant matrix C can 
be transfered out of the array processor later, 
which will take some extra time. Or, it may stay 
in the array for further processing, which saves 
both the time of unloading C and the time of data 
loading for the next stage of a computation. The 
detailed performance evaluation is presented in 
Section VI. 


III. Partitioned Matrix Multiplications 


In practice, many matrices to be computed are 
larger than the available array size, and thus the 
computation extendability is important. 


Case 2 Assume that the matrix size N is 
larger than the array size n, and that m = N/n is 
an integer. The matrix C =A * B then can _ be 
expressed in the following form: 

mn 
= 

Ci3 da rh a M2) 
Let: 2.= (isl) *n +1" and jo (J-1)*n +4° for’ 1.< 
i~,j° <n. The matrix element c,, can be ex- 
pressed as nie ae Then equatfon (2) can be 


rewritten as 
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(3) 


* b Pad 
Ii 7° IK*i-p KJ pj 


When ao 


1 1 
in the submatrix form, we have 
re = I ine -| 


(4) 


and 


m 
= LL Arg * Bay 
where , and C 


of A, B “Ig at ek positions (1, K), 
(I, aye respectively. 


are n X n Submatrices 
(K, J), and 


The submatrix multiplication A in (4) 
can be carried out on the n x n at an as oe shown 
in Section II. Since the resultant matrix can be 
accumulated in the array, it is se teeage | Ce to EPR 
pute ia on the CDLAP. The recursions D p 

+ A B are carried out and accumulated until 
K= a The resultant matrix 3 is then transfered 
to Cie The processor will then reset the accumu- 
lator matrix D and continue its computation for 


the next I, J. 


The algorithm for our partitioned matrix mul- 
tiplication can be easily expressed in a high- 
level language as shown below: 


DO 40 I= l,m 
DO 40 J=1,m 
reset D 
DO 20 K = 1,m 
20 D= Dt Aly * Bey 
Cy = Z 
40 continue 


With the provision of output buffer registers, the 
array processor only initiates the action C = D; 
it need not wait for its completion. The “actual 
transfering of output data will take place in 


parallel with the next stage of a computation in 
the array. 
To avoid extra time delay caused by the data 


output, the output data transfer must be finished 
before the arrival of the next output data; that 
is, before the completion of the computation stage 
concurrently running on the array. In our case, 
an nxn submatrix multiplication is_ performed in 
n computational steps and there are n°“ data to be 
Output. Considering the design simplicity and the 
acceptable data transfer bandwidth, n words of 
data will be output for each computational step in 
the CDLAP until all output data in the buffers are 
delivered. Neglecting the short time for array 
reseting and output initialization, the total 
time required for the case N = mn would be (m3+1)n 
units of CE re): The maximum data I/O rate is 3n 
words per unit of computation time. 


IV. Band Matrix Multiplications 


The CDLAP architecture can also be applied to 
the case of band matrix multiplications. 


Case 3 (Band matrix/dense matrix multipli- 
cation): ‘Assume that A is an N x N band matrix of 
width W = n, where N is larger than n. The matrix 
C = A * B is to be computed, where B and C are 


both N x W matrices. In Fig.3(a), an example of 
this multiplication with W = n = 4 is shown. The 
CDLAP configuration and its associated data ar- 
rangement for this multiplication are shown in 
Fig.3(b). This computation algorithm using CDLAP 
is similar to that described in [9] except that a 
two-dimensional array is used in our case. The 
function of a cage cell is shown in Fig.3(c), 
where ou X.Y. This function can be 
implemented® Gag ite tet shown in Fig.2. 


Since the accumulators of the CDLAP can be 
cleared by giving a "reset" signal before a compu- 
tation, the actual time required to complete the 
operation (including the data output) lies between 
N and (N + W) units of (t +t_). The array data 
I/O rate for this case is 2W°+ W = 3n words per 
time unit. 


4 The band matrix/band matrix multi- 
plication on the CDLAP is illustrated by the exam- 
ple shown in Fig.4(a), where the widths of the 
matrices are Wl = 3 and W2 = 4. Note that in this 
case the data shift must be in the diagonal direc- 
tion as shown in Fig.4(b), while the function of a 


Case 4 


basic cell is still the same as in case 3. The 
time required to complete this operation is 
between N and (N + min{Wl, W2}) time units. And 


the data I/O rate is Wl + W2 + (Wl + W2 - 1) = 
2(Wl + W2) - 1 words per time unit. 


V. Implementation Considerations 

The CDLAP can be implemented by cross~ 
connecting microprocessors using some bus lines. 
In order to achieve mass production and low cost, 
however, the CDLAP should be implemented in VLSI 
chips. Since the systolic array architecture has 
been considered suitable for VLSI, the CDLAP is 
compared with systolic arrays in terms of their 
VLSI implementation characteristics. 


Like the systolic array, the CDLAP is regular 
and homogeneous. The basic cell (or PE) of a 
two-dimensional systolic array for matrix multi- 
plications has six I/O channels while that of the 
CDLAP needs at most four channels, as shown in 
Fig.2. For an n x n array implemented in one VLSI 
chip, the systolic array has a total of 8n-2 1/0 
channels from its border PEs [8]. For a dense 
matrix multiplication on the CDLAP, only 2n_ chan- 
nels are needed for data input and n channels for 
data output. In Case 3, 3n input and n_ output 
channels will be required. Only in Case 4, the 
CDLAP needs 6n-2 I/O channels, which are still 2n 
less than that required in systolic arrays. 
Therefore, we expect that the I/0 circuits for the 
CDLAP chip will be less complex than those for 


systolic arrays. Taking similar I/0 schemes pro- 
posed for systolic arrays [8], fewer numbers of 
1/0 pins (bit-serial 1/0 scheme) or less time 


delay (byte-serial grouped I/0 scheme) is expected 
for the CDLAP. 


For the mask layout, it appears that systolic 
arrays have very short interconnections between 
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neighboring PEs. However, a detailed study shows 
that data lines are still needed to link the 1/0 
ports on the opposite sides of each PE in systolic 
arrays. These data lines may be laid across’ the 
PE or along the border of the PE, depending on 
design. Whatever the route is, we can at least 
implement the data-loading lines for the CDLAP 
along the data shifting route used in systolic 
arrays without occupying extra layout area. It is 
probable that a simple, straight route along’ the 
edge of a column or ae row of PEs is a better 
choice. 


The most controversial point about the CDLAP 
may be the time delay due to the data-loading 
lines. However, our estimation [10] shows that 
this time delay is much less than the unit compu- 
tation time t att Therefore, the data broadcast 


scheme of the BDLAP should not be a speed 
bottleneck. 
VI. Performance Comparisons 
The advantage of the above algorithms can be 


seen by comparing their performance with existing 
algorithms on other array processors. We assume 
no pipelining in all PEs for the convenience of 
comparisons. We also assume that the access of 
system memory modules is fast enough and appropri- 
ately arranged. For the case of dense matrix mul- 
tiplications, we let P (number of PEs) be fixed as 
ns The time efficiency can then be seen from T 
(the turnaround time of the entire computation). 
Since a smaller T can possibly be achieved at the 
expense of heavy data communication and the com- 
munication cost is not low, the transfer bandwidth 
should be considered. Here the data transfer 
bandwidth, B, is defined as the maximum number of 
words which have to be transferred through the I/0 
ports of the border PEs in a unit of computation 
time. On the other hand, the larger the processor 
array is, the larger the B value would be. There- 
fore, the value of PBT will be used to evaluate 
the performance as in [9]. For an ideal case, we 
expect that an n x n array can complete the nx nn 
dense matrix multiplication in n time units. With 
a uniform rate of data input and output, the data 
transfer bandwidth would be 3n. The PBT@ value, 
3n-, in this ideal case will be used as a _ refer- 
ence, and we define the ratio R = PBT /3n>. 


For the first three algorithms in Table 1, in 
which the resultant matrix stays in the array, the 
CDLAP has the best R (= 2/3) as compared with the 
6 and 8/3 of the other two. Assuming that the 
CDLAP uses the output data rate of n words, its R 
will be 8/3. This ratio is better than those for 
the systolic array and the algorithm in [9] as 
shown in Table 1. In the case that the CDLAP out- 
puts in the same data transfer bandwidth as_ the 
input, a further improvement (T = 3n/2, R = 3/2) 
can be obtained. 


For the mp x mn matrices, the reference PBT2 
should be 3(mn)~. The tuynaround time T of the 
CDLAP in this case is (m~+l)n, and the R value is 


Table 1. Performance of Parallel Algorithms for Dense Matrix Multiplications 


*k for n x n matrices 


WAP[7] (with results in array) 


Broadcast algorithm in [9] (with n2 
results in array) 
CDLAP (with results in array) n2 
| 
Systolic Array in [1] | n2 
eee 3 
Algorithm in Section 4.4 of [9] n 
CDLAP with the data output rate = n n2 
CDLAP with the data output rate =2n n@ 
** for mn x mn matrices 
CDLAP (data output in parallel n2 
with computation) 
m+(2/m*)+(1/m). Comparing this T value with the 


actual computation time mn of each PE, the PE 
utilization is virtually perfect for large nm. 
(For example, if m equals 5, the PE utilization 


reflects the 
same data (m times) in the 


would be 99.2%.) The R value (~ m) 
repeated use of the 
input matrices. 


The band matrix multiplication on CDLAP is 
very similar to the broadcast algorithm in [9]. 


Both have exactly the same input/output data ar- 
rangement. Hence their PE utilization and _ the 
data transfer bandwidth would be the same. Since 


the CDLAP can be set to zero at the biginning of a 
computation, the total time duration may be equal 
to or a few time units (smaller than the minimum 
width) shorter than the algorithm in [9]. Since 
this algorithm has been proven to be better than 
the systolic algorithm in [5], the CDLAP should be 
better yet. 


VII. Conclusions 


Several cases of matrix multiplications on 
the CDLAP have been presented in this paper. 
Their performance has been shown to be _ excellent 
from the PE utilization point of view. The time 
performance is either better or no worse than 
those for presently known algorithms on systolic 
arrays. For dense matrix multiplications, no 
matrix data reshaping is required; for band ma- 
trices, the data are arranged in a way similar to 
that of systolic arrays. The simple data arrange- 


ments on CDLAPs have made it easier to design the 
extended algorithms for large scale matrices. 
The preliminary feasibility study in Section 


V indicates some encouraging results on the VLSI 
implementation of the CDLAPs. This paper, togeth- 
er with other matrix operations presented in [10], 
has shown the high potential of the CDLAP as a 
cost-effective VLSI architecture. 


if B PBT” R matrix 
Sep eater eee ee oe SRST. reshaping 
3n 2n 18n> 6 yes 
2n 2n 8n> 8/3 yes 
n 2n Qn? 2/3 no 
4n 6n 96n> 32 yes 
2n 6n 24n? 8 yes 
2n 2n 8n> 8/3 no 
3n/2 2n 9n?/2 3/2 no 
(m+1)n 3n 3(m+1)2n 2 ae 
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Abstract -- Architecture of microcomputer array 
PAX is characterized by end-around nearest- 
neighbor interprocessor connection, data 
broadcasting capability, asynchronous MIMD 
operation, and global synchronization control. 
The system with 128 processors was built, and has 
been dedicated to the scientific calculations, not 
only limited to the partial differential equation 
models, but also extended to the non-nearest- 
neighbor type, i.e. Monte Carlo simulation with 
interacting particles, linear equation solving of 


Gauss-Jordan and conjugate gradient methods, and 
fast Fourier transform algorithm. High 
efficiency upto 98 % was demonstrated in these 
applications on PAX with 32 processors. The 
parallel execution times were measured and 
summarized as the scaling law expressing the 
execution time as a function of problem size and 
the number of processors in the array. The 
proven scaling law permits us to extrapolate the 


present high performance to the _ super parallel 
machine with processors more than 1000. 


1. INTRODUCTION 

Scientific calculations 
the universal principle 
stating that the physical 
immediately neighboring medium field. The 
principle assures that the interprocessor 
communication is limited in the nearest neighbor 
processors, if the original space is projected 
directly onto the array of processors distributed 
analogously to the physical space. The 
multiprocessor architecture that utilizes this 
inherent proximity property is the well-known 
nearest neighbor mesh (NNM) connection of the 
processors, which has been’ studied in machines 
such as ILLIAC-IV [1] and ICL-DAP [2]. 

This connectivity, however, has been supposed 
to be inefficient when the data exchange is 
required between any pair of processors, such as 
in the matrix calculus or in the general particle 
transport problems. The pessimistic conclusion 
for the matrix product or inversion [3] is based 
on the assignment of each matrix element to a 
processor, where the operational load for each 
matrix element (in O(N), N = size of the matrix) 
is by far less than that for the data move (in 
O(N*)). However, if we partition the matrix row- 
wise, i.e. a Single processor takes care of an 
unknown variable(s), the operational load has the 
same order, O(N*) as that for the data broadcast. 
In the case of particle transport that requires 
global data circulation in the array, the 


are characterized by 
"action through medium," 
actions come from the 
or 
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even higher order 
than that of the 
In both cases, the 


operational load shows 
dependence on the problem size 
interprocessor communication. 


users always demand to expand the problem size, 
far larger than the parallelism available at 
present and in near future, so that such a row- 


wise partitioning would be justified. 

We have started the project of developing an 
array of processors with NNM connection as well as 
data broadcasting capability, in order to clarify 
this question by implementing the practical models 


and by demonstrating the capability of the NNM 
architecture to a number of typical scientific 
problems. It 1s our standpoint that the 


applicability must be proven based on the concrete 
applications. It is because the relative value 
of the communication overhead to the net local 
calculation is a crucial factor, no matter how the 


absolute value of the overhead becomes large. 
Also from the practical viewpoint, not only the 
order dependence but also the coefficients of the 


dependent terms are important. 

The processor array developed was previously 
called "PACS", Processor Array for Continuum 
Simulations, [4]-[6], and two systems with 9 and 
32 procesors were actually built. The present 
system under testing is named "PAX-128", Processor 


Array eXperiment with 128 processors. A number 
of applications were implemented so far with 
successful performance; among them were partial 


differential equation models in aerodynamics [4], 
potential problem associated with Poisson equation 
[4], realistic nuclear power reactor calculation 
[41,17], and Monte Carlo simulations of non- 
interacting plasma particles [4] and spin systems 
in fundamental physics [6]. These models are 
characterized by the proximity property inherent 
to the physical processes simulated. The 
applications were then extended to the models of 
non—proximity type, such as Monte Carlo problems 
with general particle interactions, and matrix 
calculations in linear equation solving. 

These application problems were implemented in 
PAX-32 system consisting of 32 microcomputers. 
The time for parallel computation was measured and 
analyzed in terms of the problem size and number 
of processors. The derived, "scaling law" is the 
proven base on which we proceed the plan for 
future super parallel system with processors of 
more than 1000 and speed of more than GFLOPS. 


2. SYSTEM ARCHITECTURE 


2.1 System Configuration 


parallel computer has, at 
the prototype machine PAX-32 
[4]-[7] and the proof machine PAX-128. Both 
models employ 2-dimensional end-around NNM- 
connection, as well as data broadcating bus line. 
They comprised of three parts: Host computer 
(HOST), control unit (CU), and processing unit 
(PU) arrays. HOST and CU are commonly used by 
both machines, as shown in Fig. l. 
The PU array executes the parallel tasks. Each 
PU is essentially a single-board microcomputer. 
In the PAX-32 system, 32 PUs are implemented, 
constituting an 8 x 4 rectangular array, and in 
the PAX-128 system, 128 PUs constitute an 8 x 16 
rectangular array. The CU is a microcomputer 
that controls the PU array and communicates with 
the HOST computer, via a parallel interface. The 
CU can also. broadcast the data to all PUs. The 
HOST is a general purpose minicomputer, Texas 
Instruments 990 model 20. It is used to 
compile/assemble the source program, load the 
object program into the CU and PU, initiate the 
parallel tasks, tranfer and receive the data to 
and from the CU, and output the computed results. 


Our series of 
present, two models, 


2.2 Processing Unit 
Each processing unit consists of the components 
shown in Fig. 2, and described below. 


Micro rocessin unit MPU). An 8-bit 
microprocessor. The Motorola MC6800 (1 MHz) and 
MC68B00 (2 MHz) are used in PAX-32 and PAX-128, 
respectively. It is in charge of the program 
execution such as 8-bit fixed point arithmetic and 
logical operations, 8-bit data transfer between 
memories, and the control of the arithmetic 
processing unit. Address space is 32 Kbytes. 


Arithmetic processing unit (APU). A 
microprogrammed attached processor that executes 


fixed-point arithmetic (both in 16-bit and 32-bit 
lengths), floating-point arithmetic (in 32-bit 
length), and the elementary function calculations, 
such as logarithms, square roots, etc. The 
Advanced Micro Device's Am9511 (2 MHz) and 
Am9511A4-4 (4 MHz) are used in PAX-32 and PAX-128 
respectively. The data transfer to and from the 
APU and the initiation of every operation in the 
APU are perfectly controlled by the MPU. By 
taking the average times over the floating-point 
addition and multiplication in Am9511, we estimate 
the average floating-point execution time is equal 


to 65 micro-seconds, which means that 0.0154 
MFLOPS (Million FLoating-point Operations Per 
Second) can be performed by each PU. The PAX-32 
system's peak performance is thus estimated to be 


0.0154 MFLOPS x 32 = 0.5 MFLOPS. In PAX-128, it 
is expected to be 0.031 MFLOPS x 128 = 4.0 MFLOPS. 


Local memory (LM). The local memory for local 


data and program store. The MPU accesses to LM 
of 22 Kbytes in PAX-32, and 28 Kbytes in PAX-128. 


Communication memory (CM). The memory shared 


between neighboring PUs. This memory is used for 
the data transfer between the PUs. 


used to 
CR is 


Control register CR). This is 
transfer the control word from CU to PU. 
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write-only for the CU and read-only for the PU. 


Status register (SR). This is used to report 


the PU state to CU. It is write-only for the PU 
and read-only for CU. The PU informs its 
status by the defined in the control 
software. 


the 
code 


2.3 PU Array Control by the CU 


Before starting the parallel tasks in PU array, 


the CU has to load the programs and data into the 
proper memories in the PU array, and also it has 
to start the MPUs. In order to control and 


terminate the parallel tasks, it is necessary for 
the CU to control and monitor the PU array without 
disturbing the task execution. Together with the 
microprocessor MC6800, and local memory (1M), the 
CU has the following registers to _ be used for 
control purposes. 


Status Register of all PUs (SR-ALL). This 
indicates the logical AND and OR of the contents 


of the SR registers in the all PUs. This 
information is used in detecting the consistency 
of all SRs, in order to take the task level 
synchronization quickly in all the PUs. 


Unit Reset register (URES). 


reset the MPU, and APU. 


Unit Halt register (UHLT). This is installed 
only in PAX-128, and used to halt the MPU. 


This is used to 


Unit Nonmaskable Interrupt register (UNMI). 
This is used to interrupt the MPU in the PU. 


Unit Select register(USEL). This is used to 


control the bus-switch between the CU-bus and the 
PU-bus. When a PU is selected by the USEL 
register, the CU-bus is connected to the PU-bus. 


Each function of URES, UHLT, UNMI, and USEL is 
achieved by writing to the particular register of 
the CU the code to select the row and column of 
the particular PU in the processor array. For 
example, the following selections can be made: PU 
in row 0 and column 1, all PUs in row 2, all PUs 
in columns of odd numbers, all PUs, no PUs, etc. 

By connecting the CU-bus with all PU-buses, CU 
can directly broadcast the data and the program to 
all PU memories. 


2.4 Hardware Implementation 


Special care was taken in the hardware 
implementation in order to minimize the connection 


wiring. The structure of PAX is topologically a 
torus. The conventional method, parallel 
placement of the boards and connection on the 
backplane would result in connections too long. 
Our implementation of PAX-32 is a "folded 
array" ; that is, the original torus geometry is 
folded, as shown in Reference [4], while the 


implementation of PAX-128 is exactly a "torus" as 
shown in Fig. 3. The CU bus and the connections 
between PUs use flat cables. These methods of 
implementation reduce wiring length to nearly 1/10 
of that in the conventional method. 


2.5 Software 


Software support for the PAX includes language 
processors and Fortran subroutines executed on the 
HOST, a parallel array monitor executed on the CU, 
and a library of control procedures executed on 
the PU. Source programs for applications are 
usually written by the use of Fortran for the 
serial part of the job executed in the HOST, and 
by the use of high-level language SPLM for the 
parallel part of the job that is executed in the 
PUs. The cross-assembler MASP is available, if 
necessary, to code the parallel part of the job 
and the control procedures. 


Parallel execution of tasks can be initiated in 
PUs and data can be transferred to or from the 
HOST by calling the PCMS subroutine in the user's 
FORTRAN source program. This subroutine sends 
several commands to the CU to control the 
execution in PUs and the data transfer between PU 
and HOST. 


The high-level language SPLM, a_= structured 
programming-type language, was developed for 
parallel processing in the PAX system. Domain 
declaration is necessary to reserve specified 
memories for the storage of variables, constants, 


and program. The user must declare the domain 
CONST in order to reserve memory space for the 
program and constants. The domains for the 


variables are declared by the followings. 
VAR : For the local variables in LM memory. 
FRONT, BACK, LEFT, RIGHT : For the variables in 
CM memories used for inter-PU communications. 
Variables should be declared with their domains, 
types and size of array variables. The variable 
type can be either real, integer, or byte. 


In contrast to those languages used for SIMD 
architecture, the SPLM describes the function 
limited in a single PU; how the PU takes data from 
the CM area, how it processes the data, and how 
the data are returned to a CM. Usually the same 
program is loaded in all PUs utilizing the 
broadcast function of the CU. However, different 
programs can be sequentially loaded to PUs. The 
overall parallel computation proceeds as each PU 
executes its own program. The processing, 
therefore, inherently asynchronous. 

Task synchronization, however, is supported by 
software control. By the procedure SYNC, the 
task can be synchronized at any user-specified 
point. The synchronization is taken to ensure 
that the transferred data is correctly updated. 
It is also taken in such algorithms as_ SOR that 
uniform iteration is necessary for all PUs. 
Execution of the task resumes by exiting the SYNC 


procedure. It takes 107 and 57 micro-sec to 
synchronize all PUs oof PAX-32 and PAX-128, 
respectively. 


Nonexpert users require more system software 
support than that already described. One of the 
inconveniences occurs in the determination of the 
PU boundary when a physical space is projected 
onto the PU array space. There are several 
subroutines that assist users in these situations. 
One example is a subroutine which distributes the 
multi-dimensioned array variable in the HOST to 
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the PU array mesh. Another is a _ procedure 
providing the values of the dimensioned variable 
at the nearest neighbor mesh points without regard 
to the PU boundary. 

Data can be routed in two directions in the PU 
array by the ROUT procedure. A general routing 


procedure GR exchanges data with arbitrary PU, 
using the tags indicating the destination of the 
data. Data are routed in two directions, and 


each PU catches the data with 
the array. 

Another possibility for the data move among PUs 
is the data broadcasting through the CU. By 
calling procedure BRDCAST, the user can move the 
data in any PU to all other PUs. 


tags circulating in 


3. APPLICATIONS AND SCALING LAW 

introduced here; these 
are typical non-proximity type problems. The 
first example is a Monte Carlo simulation of 
moving particles that interact each other through 
the potential field that they create. Example 
was taken from the molecular structure simulation 


Three applications are 


for the amorphous material. The second example 
1s the well-known linear equation solving by 
typical schemes: Gauss-Jordan and conjugate 


The third example is the well- 
known fast Fourier transform algorithm. It is 
the final goal of the study to express the 
measured time of execution as a function of the 
problem size and the number of processors. 


gradient methods. 


3.1 Execution time and efficiency 


Let us define the efficiency of parallel 
processing by the array of P PUs as follows. 


od = Ts /(PTp) 


where Ts = time for serial execution of the job by 
a single PU, Tp = time for parallel execution of 
the same job by P PUs, and P = the number of PUs 
constituting the array. 

Let us limit the discussion to the iterative 
processing, where one iteration is a sequence of 
(1) synchronization of all PUs, (2) data move, and 


(3) net calculation, as shown in Fig. 4. The 
process is iterated until convergence. The 
processor may fall into the idling state before 
the synchronization point, since the net 


calculation may be different from PU to PU. 

The overhead is defined here as_ those tasks 
that do not appear in the serial processing. 
Here it consists of the synchronization, the data 
move, and the processor idling before the 
synchronization. Then the following expressions 
are derived. 

Tp = 


max Tj + max Tcj, Ts = dj EAs 
J J J 


PTp= 3) Tj +2, Twj + 2,Tcj, 
J J x) 


o = (Ip - Iw- Tc)/Tp = 
(net calculation time)/(total processing time), 


where Tj = time for the net calculation in the 
j'th PU, Tcj] = time for the data move and 
synchronization with the j'th PU, Twj = idling 


the 


time of j'th PU, Tw = idling time of PU 
averaged over all PUs, Tc = data move time 
averaged over all PUs. 

If the algorithm for parallel processing 


differs from that for serial processing, then the 
algorithmic degradation factor [8] must be applied 


to discuss the speed-up ratio over the 
conventional computer. 

3.2 Monte Carlo Simulation of Interacting 
Particles 

3.2.1 Model 


Monte Carlo simulation models can be classified 
into three categories: 


ly Non-interaction type model, in which the 
particles move independently from each other. 
Radiation transport model represents this 
category. 

2.  #Nearest-neighbor type model, in which the 
particles or spins interact with those in the 


nearby space, but do 
field models in the 


this category. 


not move. Spin systems or 
fundamental physics fall into 


3. Interaction type, in which the particles 
interact with each other through the field that 
they create. This is the model of the most 
general type. The plasma particle simulation, 


molecular dynamics, 
this category. 

The model of the third 3. category is taken as 
an example, i.e. the simulation of the structure 
of amorphous material [9]. The physical space is 
the 3-dimensional rectangular space with the 
cyclic boundary condition, i.e. one end of the 
space is connected with the opposite end. 
Simulation goes as follows: 

1. The particles are randomly distributed in 
the space. 


and galaxy model belong to 


2. The potential and its derivative (i.e. 
force) are calculated between all pairs of 
particles. 

3. The particles are pushed by the force 
which they receive. The moving distance of each 
particle is determined so as to minimize the 
potential energy that the particle creates with 


the nearby particles. 
4. The physical parameters of interest, such 
as total energy are calculated. 


De The procedures 2. and 4. are repeated 
until the overall potential energy reaches the 
minimum. 


3.2.2 Parallel Scheme 


The parallel partitioning of the simulation 
follows either one of the following two methods or 
the mixture of them. 

1. "Lagrangian scheme", where the particles 
are divided into subgroups of equal number of 
particles, each to be assigned to a PU, and the PU 
is in charge of everything that happens to the 
particles, no matter how the particles are moving 
in the physical space. 

ee "Eulerian scheme", where the physical 
Space is divided into subspaces of equal size, 
each to be assigned to a PU, that takes everything 
happening in the subspace. 


98 


Analyzing the implemented program and measuring 


the execution time, we found Eulerian scheme 
unattractive, since the non-uniform particle 
distribution causes the degradation of the 
efficiency [10]. It was found, in the Eulerian 


scheme, that, as the number of particles N becomes 


large, the calculations of potential, force and 
particle move in the busiest PU dominate in the 
total execution time. The efficiency becomes 


asymptotic to 1/f as P tends to infinity, where f 
is the peaking factor of the particle number 
distribution, l.e., ratio of the maximum over 
average numbers of particles per PU. 

Discussion will be centered 


on the following 
Lagrangian scheme. | 


Parallel Lagrangian scheme. 
1. The particles are uniformly distributed in 


the PU array. 

La The all particle data are circulated in 
the PU array. This task is carried out by 
synchronized routing of data, as follows: 

2-1. The particles ("data" is omitted 
hereafter) are routed in the RIGHT directions in 
all PUs, until each of the four PUs in the row of 


the array shares all particles located in this 
roW. 

2-2. Then the particles are transferred in 
the FRONT directions in all PUs until all 
particles encounter each other in the array. 

3% The potential energy and force are 
calculated in each PU during the circulation of 


particles, that goes in parallel in all PUs. 

4. The new particle positions are calculated 
in each PU in parallel. 

5. The physical parameters such as the total 
potential energy, and average particle move during 
the iteration are calculated. These global sum 
quantities over all PUs can be calculated in 
parallel by the well-known cascade sum method 
[11], where log, 32 = 5 additions and the nearest- 
neighbor data transfer of the distance of 10 PUs 
(i.e. data are routed twice in RIGHT direction by 
1, and 2-PU distances, and then routed three-times 
in FRONT direction by 1, 2, and 4-PUs). 

6. The procedures from 2. to 5. are repeated 
until the potential energy reaches minimum. 


3.2.3 Parallel Execution Time and Scaling Law 


particles on the 2- 
of P = Pl - P2 PUs. 
n=N/P, i.e. average 
the 1-dimensional 
cut-off length of 
the potential 


Suppose we have N 
dimensional rectangular array 
A few parameters are defined; 
number of particles per PU, d= 
size of the physical space, d* = 
potential, i.e. the distance where 
reaches zero, and c = d*/d. 


and measured time 
not specified, 
unit of msec 


Parallel processing program 
are analyzed as_ follows. 
execution time is measured 
throughout this paper. 


Tf 
in 


1. Particles are uniformly distributed among 
PUs; n particles per PU. | 
2 The n particle data are transferred to 
RIGHT neighbor in (Pl-1) times, and nPl data are 
moved to FRONT neighbor in (P2-1) times, so that 


this global circulation of three coordinate data 


takes 
Tio died a (PHL) 3 8s 
3 Potential and force calculations take 
T2 = 3.8 nN. 
4, Particle move takes time proportional to 
the number of particles within the reach of 


potential function d*, i.e. 


T3 = n (5.7 + 30.7 Ne). 
2% Three physical parameters which require 
the cascade sum and global data routing are 
calculated. It takes 


T4 = 3 (log. P Cts: 40.753): 0.17 CPL + P2 =2)), 


The scaling law: the total 


time Tp is represented by 


parallel execution 


Tp = Tl + T2 + 13 + T4, 
where Tc = Tl + T4, Tw = 0, and the efficiency by 
A = 1 - Tc/Tp. 


These are calculated for several parameter values, 
as shown in Table I. 


Table I. Measured execution time and efficiency 
in parallel Monte Carlo simulation of amorphous 
material (Lagrangian scheme) on PAX-32 array. 


0.23 (0.24) 96.60 (96.48) 
0.27 (0.28) 97.09 (97.00) 
0.31 (0.32) 97.46 (97.38) 
0.34 (0.35) 97.80 (97.68) 


6.77 ( 6.87) 
9.29 ( 9.29) 
12.20 (12.08) 
15349 -:C15-.23) 


Total execution time 
= J 
in ( ) 


N = Number of particles, Tp 
(sec), Td = Data move and idling time (sec), & 


~ Td/Tp = Efficiency (%). The values 
are calculated from the scaling law. 


Since we assume that number of particles per PU 
n = N/P is a constant parameter determined by the 
memory capacity of PU, the intra-PU calculation 
has the same dependence O(N) as_ the inter-PU 
communication. As the number of particles N 
increases, these terms with the dependence O(N) 
become dominant in the total execution time Tp. 
This makes the efficiency asymptotic to 
constant factor, 30.7 nc /(30.7 nc + 1.2), as 
tends to infinity. 


a 
FP 


the inter-PU 
the intra-PU 


Again it must be noted that 
communication does not dominate over 
calculation. 


3.3 Linear Equation Solving by Gauss-Jordan and 
Conjugate-Gradient Schemes 


3.3.1 Job Partitioning into Parallel Tasks 


Parallel processing of linear equation has been 
extensively studied and parallelism of several 
levels has so far been exploited [11], [12]. 
However, it is still important to evaluate the 
parallel scheme of these well-known algorithms 
such as Gauss-Jordan and conjugate gradient 
methods, by implementing on the machine actually 
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Operating. It is because, as pointed out in [3], 
not only the intra-PU operational tasks, but also 
the inter-PU data move may take major part of the 
time, leading to the very low efficiency. 

As briefly mentioned in the introduction, we 
made an approach to the matrix processing 
different from those assumed in Ref. [3]. Our 
partitioning of the linear equation problem is 
such that each PU takes care of one or several 
successive row(s) of matrix and corresponding 
elements of unknown and constant vectors, as shown 
in Fig. 5. Major reason is that the balancing is 
possible between work loads of intra-PU and inter- 
PU processings as pointed out in the introduction; 
O(N?) work load of intra-PU process vs. O(N?) load 
of broadcasting among PUs, where N = matrix size. 
This row-wise partitioning 1S appropriate as well 
in constructing the stiffness matrix in finite 
element analysis. If each PU takes care of the 
unknown variables associated with nodes or mesh 


points of the physical structure, the matrix 
generation itself can be made with high 
parallelism. 

For the practical use of the elimination 
schemes, the pivoting is important to get the 
reliable solutions. The maximum element being 
the next pivot element is found by the comparison 


among the column elements. Since each row vector 
1s stored in each PU, this maximum finding is 
carried out by cascade comparison method, in which 
the data are routed similarily to the cascade sum 
method, but the comparison is made instead of 
summation. When a few rows are assigned to each 
PU, the cascade comparison scheme is made after 
the intra-PU comparsion. 


The inter-PU communication is made by utilizing 
the broadcasting under the control of CU, from 
local memory of any single PU to local memories of 


all other PUs. The circulation of data through 
the CM memory is not used for this linear equation 
solving, except the norm calculation over the 
distributed data in the array. This is because 
the work load for broadcasting uniformly 
distributed data to all PUs (i.e. making all 


combinations of data in the PU array) is the same 
as that for the same work by the data circulation 
via nearest data move. Suppose we have N data in 
P PUs, iee. N/P data for a PU. The broadcasting 
N/P data P times takes time proportional to N, 
while the data circulation via nearest data move 
takes (N/P)(P1-1) + (N/P2)(P2-1)=N(P-1)/P, where P 
Pl + P2. In the present PAX system, single 
data move in both broadcast/nearest move takes two 
steps; data write to and read from the CU/CM 
memories. For P equal to or greater than 32, 
both works take nearly the same time. 

It must be noted that both methods do not give 
the same work load in the case of Gauss-—Jordan 
elimination, because only the pivot row data are 
broadcasted to all other rows. This is the case 
where the broadcast method is superior to the 
circulation method. It must also be noted that 
there is no need to exchange the whole row vector 


data, because only the index number that tells 
where the row is located in the matrix must be 
altered. 


3.3.2 Parallel Scheme 


Variable assignment is made such that the 
matrix and vectors are divided into P blocks, each 
having N/P rows, and the i'th block is taken care 
of by the i'th PU. For simplicity, the 


explanation of algorithm below assumes N = P, i.e. 


one~row-to-one-PU correspondence. Figure 6 
illustrates how the scheme goes in parallel in 
PUs. 


Parallel Gauss-Jordan scheme. 

For k=1,2,...,N, the following 
sequentially executed. 

1s From the k'th PU, the k'th row of the 
matrix and the k'th element of the constant vector 


b are broadcasted to the all PUs. 
ar The k'th PU executes 


two-steps are 


k) -1) -1) 
an := a‘ / sia 
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kK) ° Kj KK °? 7 ’ 


(k<jxN), by tages 


The rest of the PUs execute 
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(k<Sj<N, 1<i<N, i*k). 


When the partial pivoting is made, the k'th row 
is one that has the largest absolute value among 
Doe » (ks <n). 


Parallel conjugate gradient scheme. 


The conjugate gradient method searches’ the 


solution vector that gives the minimum value of 
the error norm. The scheme starts with the 
initialization, 
0) —>(0) _— —»(o) 0) —» (0) 
ri >= b-Ax so := xr » k=0. 


The following steps are repeated until either 
k>N or |%|<e, where e = error criterion. The 
matrix A is assumed to be sysmetric and positive 
definite. 


) >) 
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+ 
k z= k+l. 
Parallelization of the scheme follows the same 
partitioning of the variable as parallel Gauss- 
Jordan scheme. The vector and matrix 
calculations are executed as follows. 
Is Vector addition or subtraction is made 
independently in each PU. : 
2. Scalar product of vector is made such that 
the partial product is calculated independently in 


each PU, and then the partial product is summed up 
by cascade sum method. 


3. The product of matrix A and vector ¥ is 
made by such a way that, 
(a). the j'th element’ of ¥Y, x; is 


broadcasted from the j'th PU, and that, : 


(b). inthe i'th PU, ai Xj is summed up 
into the partial sun. 
These steps (a) and (b) = are 


sequentially for j=1,2,...,N. 


repeated 


3.3.3 Execution Time and Scaling Law 


Implementation of these two schemes on PAX-32 
system was made successfully for the problem 
sizes, N=32, 64, 96, 128, and 160. For the 
conjugate gradient method, the experiment was made 
for two matrices, one giving good convergence CASE 
(A), and the other poor convergence CASE (B). 
The scaling law is the function that expresses the 
time for parallel calculation in terms of the 
problem size N, the number of processors P, and 
the number of iterations M in the conjugate 
gradient method. The computation time was 
actually measured and summarized in Table II. 
The scaling law was derived from these measured 
data with good accuracy as shown below. 


Let us define several operation times for 
elementary processings; ts = time for taking 
synchronization, tr = time for real data move 
between nearest neighbor PUs, ti = similar time 
for integer move, ta = time for data addition. 
The measured times for PAX-32 PU array are 
ts=0.107, tr=0.161, ti=0.126, ta=0.412 (msec). 

Time for cascade sum is expressed by 

Tcsum = (ts + ta) log,P + (Pl + P2 -2) tr. 

Time for broadcasing k data from any PU to all 
other PUs takes 
Tbrd(k) = (0.160k + 0.02 k/64 + 0.261) + ts. 

Time for partial pivoting Tpiv can be obtained 
by a similar expression for Tcsum except that the 
time for addition ta is replaced by the time for 
compare and substitution, i.e. ta'=0.870 (msec). 
The data moved are one real number (an element of 
matrix) and two integers (PU and row numbers). 

The scaling law for the Gauss-Jordan scheme is 
expressed by using these measured total and 
elementary times, as follows. 


Tgj(N,P) = 0.011 (N/P+1) + 1.047 N /P + (Tpiv + 
0.495)N +2 Tbrd(k+1) + 0.157 N (N+1)/P. 


scaling law can 
shown in Table 


The estimated time by this 
reproduce the measured values as 
II. 


The scaling law can be derived similarily for 


the conjugate gradient method. The time for the 
product of matrix and vector Tmat is derived as 
Tmat(N,P) = 0.033 N/P + 2.43 + P (0.222 + 


Tbrd(N/P) + 0.094 (N/P-1) + 0.224 (N/P) ) 
The total parallel computation for conjugate 
gradient method takes 


Tce(N,P,M) = (0.1 + 0.725 M) N/P + 0.379 M+ 
(M+1) Tmat(N,P) + M Tbrd(1) + (3M + 1)(Tcsum + 
0.173) + 0.5, 


where M=number of iterations. The estimated 
times by this expression is compared in Table II 
with the measured values. 


The efficiency defined in 3.1 can be applied to 
these parallel executions as well. For the 
Gauss~Jordan scheme, the idling time immediately 


100 


before the synchronization Tw is almost zero, 
because that the difference between’ the 
processings for pivot PU and non-pivot PU is 


small, and that the idling occurs in only a single 
PU, i.e. pivot PU. The efficiency is therefore 
approximated by gj = (Tgj - Tc)/Tgj, where Tc 
is the time for data move, which is represented by 
Tc = N (3 ts log, P + (Pl + P2 -2)(tr + 2ti) for 
Gauss-Jordan scheme. Similar expression holds 
for the conjugate gradient scheme: a cg = (Tcg- 
Tc)/Tcg, where Tc = P Tbrd(n) + M Tbrd(1) + (3M + 
1)(ts log,P + tr (Pl + P2 - 2)). 

The evaulation of the efficiency is made 
assuming that n = N/P, the number of rows per PU 
is a fixed parameter. Asymptotic behavior of the 
efficiency is such that 


lim&gj = n/(n + 0.510), for Gauss-Jordan, 

P-»00 
lim&cg = (n* + 0.419 n + 0.573)/(n7 + 1.14 n + 
P-vo0 2.22), for conjugate-gradient. 


Efficiencies of both schemes become asymptotic to 
the values ranging from 46 % for n=l of conjugate 
gradient scheme (the worst case) to 95 % for n=10 
of Gauss-Jordan scheme, and get to 100 % as noo. 


Again the inter-PU communication does _ not 
devastate the efficiency of these parallel linear 
equation schemes. 


Table II. Measured execution time and efficiency 
in linear equation solving on PAX-32 array. 


<< GAUSS-JORDAN SCHEME >> 


Pe [9 wo [ae 


0.585 (0.640) 
2.558 (2.485) 
6.636 (6.502) 
13.853 (13.656) 
25.676 (24.912) 


<< CONJUGATE-GRADIENT SCHEME >> 
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0.509 (0.473) 
0.831 (0.777) 
1.286 (1.223) 
1.879 (1.812) 
2.602 (2.545) 


(1.586) 
(5.106) 
(11.957) 
(23.514) 
(41.151) 


1.651 
5 337 
12.297 
23 .884 
41.341 


Total execution time (sec), 


= Matrix size, Tp = 

= Efficiency (%), Ni = Number of iterations. 

Values in ( ) are calculated from the scaling 
law. 

Case (A) is a matrix example that gives good 

convergence in the conjugate-gradient scheme, 

while Case (B) is one that gives poor convergence. 


3.4 Fast Fourier Transform (FFT) 


We have also implemented a Parallel FFT (Fast 
Fourier Transform) on PAX, and have estimated the 
efficiency by observing the time of execution. 
Note that an FFT algorithm is in need of data 
exchange between distant PUs. 

Despite the existence of the specialized 
hardware for FFT, we were motivated to try the 
parallel processing of FFT on PU array, because 
that single processor array should work in two 
phases of interacting particle transport problems 
by the particle-mesh-method [13], where the 
potential field is solved Eulerian by using FFT, 
and the particles are pushed Lagrangian. If we 
use two systems: special FFT hardware and the 
processor for particle push, there must be wide 
band-width among these processors and common 
memory. If the PAX is proven fast enough in FFT, 
the whole problem can be solved on the processor 
array of PAX. 


3.4.1 Parallel FFT Algorithm 


Using the Cooley-Tukey [14] notation, FFT 
algorithm is written as _ follows. Consider the 
problem of calculating the complex Fourier series, 


N-1 jk 
© (4) = 2) X(k) W , (1) 
ke0 


where the given Fourier coefficients X(k) are 
complex and W is the principal N'th root of unity, 
W = exp(21i/N). Suppose N is a power of 2, i.e. 
N=2", Then let the indices in (1) be expressed 
as 


j =0,1,...,N-1 3 


n-1 n-2 


2 aes ee Pile eerie iF 2+ jos 


tui. 
i] 


ie Do A Oe Soe hee 2 eg. 2g 
This 
j and 


where jy and k, are equal to O andl. 
expression gives a unique representation of 
k. So, this allows Eq.(1) to be written as 


es dake eiphi DD ed 


j K Ko Ky Kn-1 
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k,, ky) W 


n-2 ? eee 3 1°? 


The innermost sum, over ky, , can be written as 


k,, kg) = dD, X(kn-15 kn-2 5 
Kn-1 


n-2 3 eee 3 
W jo Kn- 34 7 is 


A, (jg. k 
@eey k,; Kg) 


Proceeding to the next innermost sum, over k,, , 
and so on, one obtains recursive notation, 


Ay Cio: eee 3 Jy-1 ’ ky fags. 22% Ko ) “2 Ag. 4 oF 
n-2 


jul Hee go 


k 2o+K,) W (2) 
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for 1=1,2,...,n. Equation (2) shows that N 
independent calculations are obtained at each l1'th 
step. In Fig. 7, FFT data flow diagram is shown 
where N=8, in which 
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is multiplied on —> , and summed, over ky», on 
@. After 
evaluated at the right most 
independent N processes, —» 
separated by ---. 

Each PU(i,j) is corresponded with a non- 
negative integer p = iPl + j, where the number of 
PU's is P, P=Pl - P2 (Pl ,P2 are also power of 2). 
Then each P(p), which means the PU(i,j), may have 
only one Fourier coefficient X(p) or Ap(p), if 
N=P. Note that -—» includes a data move 
operation between Ag4(jg 5 «ees Jegs 0, Kyp_y ; 
eee 5 ky) and Age Ces eee, Jp-2 9 Le n-§-49 °89 3 
ky ),; and this corresponds to a data routing 

MN. ° 
between PCjgs eee 5 Jens Os Kyipig os ces ky) and 
PCie a wees) pos ts Kycpng so Sew sg? 

Then the interval between these PU's is 2 
so this can be easily realized using the routing 
Operation of PAX. 

The general case NOP, 
data which is 


Clearly 
can be 


side. 
and @® , 


n-2 
3 


each PU must have N/P 


X(pN/P + q) or A(pN/P + q), for q = 0,1,...,P-l. 


First, calculate the pseudo Fourier series 
An/p~1 for the data distributed over each PU by 
using a serial FFT algorithm. Second, An/p-1 18 
considered as N/P sets of Fourier coefficients to 
be calculated, the q'th set is An/p-1 (pN/P + q), 
for q = 0,1,...,P-l. 


3.4.2 Execution Time and Scaling Law 
algorithm as 


We rewrite the parallel FFT 
follows in steps 1 and 2. 


Parallel FFT. 

1. Calculate N/P data at each PU using the 
serial FFT algorithm, indicated by subscript 
"seq'. 

2. Execute parallel FFT processing with all 
PU's; 

arithmetic part —» operation and @ operation, 
both indicated by subscript 'op', 
routing part indicated by subscript 'r'. 


The execution time of each part is written as 
Tseq(N/P), Top(N), Tr(N,P), then Tpara, the total 
time of parallel FFT processing, is expressed as 
Tpara = Tseq(N/P) + Top(N) + Tr(N,P) . 


Table III. Measured execution time in parallel 
FFT algorithm on PAX-32 array, where P=32. 
Parallel Serial Efficiency 
Execution | &=Ts/(Tp P) 
Time, Ts (%) 

(msec) 


Number of 


Data per PU} Execution 
N/P Time, Tp 
(msec) 


258.0 
612.6 
1358.2 
2996 .6 


In Table III, the result of measurement is shown 
where P=32. Tseri, the time of serial FFT 
processing for N data with 1 PU, is the 
comparison, and it is clear Tseri = Tseq(N). 


n steps, the Fourier series can be > 


There exists an overhead of parallelization 
while the values for Tseri/Tpara are all less than 
Py This overhead in the parallel FFT processing 
is caused by 

ae Idle PU's as a result of 
calculation. 

2% Routing. 

Cause 1. is a result of measuring the execution 

time of the PU whose calculated load is the 
heaviest. Related to Cause 2. is Tr, the total 
time for routing, and is represented as Tr = ts 
log,P + tt(N/P) 3/P, where ts and tt are the unit 
time for synchronization and data transfer. Top, 
the time of the arithmetic’. part, is also 
represented as Top = top(N/P) log,P + constant, 
where top is the maximum execution time of one 
repitition. Moreover, Tseq(n), the total time of 
the serial FFT processing for n data with 1 PU, is 
Tseq(n) = tl n log,n + t2 n + constant. 
Define g as the efficiency of parallelization, d = 
Tseri / (Tpara + P). Assigning the measured 
parameters ts=0.14, tt=0.18, top=1.80, tl=1.82, 
and t2=1.20 (msec), we, then, get ad and the ratio 
of routing in Tpara as we increase P, the number 
of PUs, as shown in Fig. 8. Though d> 0 as 
P-0o0, parallel FFI processing on PAX is well 
practical, within the currently predicted limit of 
realization (about 10000 PUs). 


unbalanced load 


4, PERFORMANCE EXTRAPOLATION 


It is interesting to wonder whether high 
efficiency is still maintained, even if the number 
of processors gets very large, say more than 1000. 
Proven scaling laws permit us to extend the 
present performance to the super parallel systems. 


The hardware can be speeded up by the factor of 
100, which is realized by a microprocessor of 20 
MIPS and 45 nano-sec memory devices. Then the 
system performance could be raised accordingly by 
the factor of 100, except the synchronization time 
ts, that consists of a fixed time for detection 
plus a time proportional to the size of the array 
P is necessary for the synchronization signal to 
traverse twice the PU array. However, the size- 
dependent term does not become great within the 
implementation limit P<1,000,000 [6]. The signal 
skew [15] is not the primary limiting factor to 
the maximum number of processors, either [6]. 


The extrapolation of performance, therefore, is 
such that, for number of processors greater than 
1,000,000, the efficiency will decrease, while, 
for the realizable array size (P<10,000), the 
size-dependent synchronization overhead does not 
affect the performance extrapolated from that 
obtained on PAX-32. 


The same statement as before must be repeated 
here: the inter-PU communication does not 
devastate the practical applicability. 


2. CONCLUSIONS AND FUTURE PLAN 


The pilot machine PAX-32 demonstrated that the 
nearest neighbor mesh processor array operable in 
MIMD mode has been proven capable of executing the 
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wide scientific applications, with the efficiency 
high enough for the practical use. The 
applications implemented on PAX machine cover 


those problems without proximity, such as particle 
transport problems with general interaction, 
linear equation solving with matrix and vector 
calculations, and FFT algorithm. It was proven 
that, in the practical limit of the size of the 
array, the efficiency is not devastated by the 
data move between processors, and the almost 
linear speed-up can be expected. 

The aim of our PAX development project is to 
demonstrate the practical usefulness of NNM array 
by physical means, lee. by constructing the highly 
parallel NNM-connected PU array, and by solving 
the end-user's large scale scientific applications 
on it. The demonstration may only tells that the 
machine can be sufficiently good in such and such 
applications, (what we call sufficient condition 
approach), and it does not say anything about what 
is necessary for general applicability, (what we 
call necessary condition approach). Though the 
general applicability has not been established so 
far, we are going to pursue the sufficient 
condition for practical machine, and someday we 
will ask, "Is such a general-purpose machine 
really necessary that could cover applications 
wider than those covered by our PAX system ?" In 
short we hope to wipe out the widespread pessimism 
on the NNM array since ILLIAC-IV. 

In order to provide super computing power, by 
far greater than the presently available, in cheap 


cost for the end users, we have a plan _ to go 
through several developmental steps, from the 
present PAX-128 system, consisting of 128 one- 


board microcomputers, with approximately 4 MFLOPS, 
to our final goal PAX-1M, consisting 1 million 
VLSI-processors with 1 Tera FLOPS speed, that may 
be realized in early 1990's. 


We are optimistic in going our way. Optimism 
is sometimes dangerous, but no doubt motivates the 
progress. 
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Fig. 2. Configuration of Processing Unit of PAX 
System. 
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Fig. 4. Time Chart of Typical Iterative Scheme. 
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POWER SUPPLY CP] intervals for communication, net calculation, and 
wait, respectively. Synchronization of all PUs 
is taken at SYNC. 


Fig. 3. Hardware Implementation of PAX-128 System. 
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Fig. 6. PU-Time Diagram in Parallel Gauss- 
Jordan Scheme for the case of Matrix size = 
Number of PUs = P. Symbol P.E. stands for 
the pivot-row elimination and N.P.E for the 
non~pivot-row elimination. 
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PARTITIONING JOB STRUCTURES FOR SW-BANYAN NETWORKS 
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Abstract 


Large multiprocessor systems interconnected 
by multistage SW-banyan networks may suffer 
from communication blockage if resources are not 
adequately allocated to processes within jobs. 
This communication blockage can quickly lead to 
performance degradation. Given a set of job 
structures, the problem is to map these structures 
onto the network in such a way that the amount of 
communication blockage induced by the mappings 
is held to a minimum. In general, this problem is 
very hard, bearing resemblance to the graph 
isomorphism problem. It has been solved for cer- 
tain structure types. This paper describes one 
general purpose mapping method that is suitable 
for structures that can be partitioned in a partic- 
ular manner. The structures are mapped onto 
regular SW-banyans in such a way that no commu- 
nication blockage can occur. 


INTRODUCTION 


In highly parallel multiprocessor MIMD systems 
interconnected by large, complex multistage inter- 
connection networks which possess the blocking 
characteristic, assignment of resources to proc- 
esses becomes a problem of primary significance. 
It is important, whenever possible, to assign the 
resources in such a way that the amount of com- 
munication interference is held to a minimum, both 
on a global scale (among all jobs) and on a local 
scale (among the processes within a job). The 
problem is an extremely complex one, one which 
includes a great number of orthogonal variables. 
These variables include job sizes and structures, 
job mix, system organization, network topology, 
and whether resource allocation is to be made on a 
dynamic or static basis, to name just a few. 


So that the systems being considered can be 
allowed to be scaled up to larger and larger sizes, 
it is necessary that resource allocation schemes 
for such systems exhibit satisfactory performance, 
preferably linear in the number of resources and 
job size, and certainly not much worse than loga- 
rithmic. It is also extremely important for 
resource allocation schemes to exhibit as much 
flexibility as reasonably possible. The number of 
different job structures that will be presented to a 
system is usually potentially very large. It is 
impractical to develop and maintain a large col- 
lection of specialized resource schedulers, each of 
which is suited to only one or some small number 
of job structures. Instead, a small number of gen- 
eral purpose resource schedulers with great flexi- 
bility is desired. 


This paper discusses a resource scheduling 
scheme for use in such a scheduler. As previous- 
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base and any apex; 


ly mentioned, any resource scheduler must incor- 
porate specific values of many variables. The 
resource scheduling scheme described here is use- 
ful for mapping process-structured computations 
onto multistage regular SW-banyan networks, both 
rectangular and non-rectangular. All processors 
are located on one side of the network. Communi- 
cation paths pass from one processor all the way 
to the other side and then back through the net- 
work to the desired other processor. All commu- 
nication paths are bidirectional. The nodes on the 
network side opposite that of the processors can 
be connected to memory modules, as in TRAC 
[Sejnowski] or the Ultra computer [Gottlieb], or 
they can simply be "bounce back points." It is 
intended that they will be memory modules, capa- 
ble of buffering communication. Communication 
can be either by packet switching or circuit 
switching - the resource allocation is the same. 
(However, as will be explained below, additional 
enhancements are possible in packet switching 
systems.) The jobs are assigned all resources at 
once but jobs may be scheduled dynamically. The 
structure of a job is arbitrary but must be strong- 
ly partitionable. This property is defined below. 
Applications of the resource scheduler to struc- 
tures which are not strongly partitionable are dis- 
cussed at the end of this paper. 


DESCRIPTIONS OF THE VARIABLE SPACE 


This section defines the variable values for 
which the described resource allocation scheme is 


suited. Following this section, the allocation 
scheme itself is presented. 
SW-banyans 

SW-banyans are formally defined in [Goke]. 
Figure 1 illustrates an SW-banyan. In this 


figure, the nodes of the banyan are arranged in a 
number of distinct levels, with level 0 being at the 
top, level 1 at the level below level 0, and so on. 
The last level is denoted "level L." Nodes in 
adjacent levels are connected together through a 
set of banyan crossbars. Within each level x, the 
fanout of a node is denoted f(x); the spread of a 
node is denoted s(x). The fanout of a node is the 
number of edges exiting below the node. The 
spread of a node is the number of edges exiting 
upwards. Nodes with fanout of zero are called 
bases; nodes with spread of zero are called 
apexes. There is exactly one path between any 
this is the "unique path” 
property of banyans. The nodes in each level are 
numbered from left to right, from 1 to n(x), 
where n(x) is the number of nodes in the level. 
SW-banyans are three dimensional structures that 
can be represented in many ways in two dimen- 
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sions. Throughout this paper, all SW-banyans will 
be represented in a base-distance decomposition 
manner [DeGroot]. In such a decomposition, each 
level x of the banyan can be seen to contain f(x) 
sub-banyans (component banyans) below that lev- 
el. Thus in Figure 1, f(0), the fanout of the 
nodes at level 0 (the apex nodes) is 3; and at lev- 
el 1 there are three component SW-banyans. 
Although the class of banyans includes busses, 
trees, and crossbars, we are interested here only 
in multistage SW-banyans (i.e., SW-banyans with 
L > 1) that have f(x) = 2 and s(x) 2 2 for all 
levels x. This class includes such well known 
networks as the Omega [Lawrie], binary n-cube 
[Pease], Flip [Batcher], Baseline [Wu], and Gen- 
eralized Cube [Siegel]. SW-banyans are called 
regular if for every level x, f(x) is the same for 
all nodes within level x, and so is s(x). If n(x) 
is the same for all levels x within an SW-banyan, 
the banyan is called rectangular. If f(x) and n(x) 
are the same for all levels x, then the banyan is 
called strongly rectangular; if all values for n(x) 
are the same but the values for f(x) differ 
between levels, the banyan is weakly rectangular. 
The resource scheduling scheme presented here is 
applicable to regular SW-banyans, and to either 
rectangular or non-rectangular ones. The sched- 
uling scheme is first described using only rectan- 


gular SW-banyans. It is shown later how 
non-rectangular SW-banyans may also be used. 
Process-Structured Computations 

All computations are assumed to be 


process-structured computations. Such computa- 
tions consist of a number of disjoint processes (or 
tasks), each with its own memory and processing 
element. The processes do not share memory, but 
they may communicate over logical communication 
channels [Kahn]. It is possible that these chan- 
nels may be based in memory, as is the case, for 
example, in TRAC [Sejnowski]; but this is not 
necessary. In TRAC, the memory serves to buffer 
communication. A graphical representation for a 
process-structured computation would appear sim- 
ply as a set of nodes with each node representing 
one process. If two processes are to communicate 
with each other, then there will be an edge con- 
necting their two respective graph nodes. The 
number of processes must be determinable at com- 
pile time, as must the number and distribution of 
logical communication channels [Kahn, Hoare]. 
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Dynamic Job Mixes 


Jobs are assumed to be able to arrive and com- 
plete at will, thus the job mix is a dynamic one. 
However, the mapping scheme presented here 
presently works best when it has its own partition 
of the banyan in which to work. If other jobs are 
already resident in the system, as would be true 
in any dynamic’ system, and a_ separate 
sub-banyan can not be found, the success of the 
mapping method presented would most likely suf- 
fer. The operating system must then clearly be 
relied upon to treat sub-banyans as a schedulable 
resource. Compensating for a partially busy net- 
work is considered at the end of this paper. 


System Organization 


The systems considered consist of a large num- 
ber of processors (say 100 to 1000 or more), all of 
which are located at one side, the apex side for 
example. Nodes on the base side may be con- 
nected to memory buffers or may be "bounce 
back" points. The interconnection network is an 
SW-banyan, and therefore the network is charac- 
terized by the "unique path" property. For 
processor A to communicate with processor B, A 
originates a message which travels down the net- 
work from the apex to which A is connected to 
some base node and then back upward through the 
network to the node to which processor B is con- 
nected. Any base node can be used for this com- 
munication. Consequently, there are many 
different possible communication paths between 
processors A and B, even though the network is a 
"unique path" network. However, once a base is 
selected, there is one and only one path from it to 
each of the processors A and B. The scheduling 
scheme presented relies heavily on both of these 
facts. 


SW-banyans, like most multistage intercon- 
nection networks, are blocking networks. Thus, 
given a set of communication paths to be estab- 
lished, it may not be possible for all communi- 
cation paths to be set up in the network without 
having two or more of the communication paths 
interfere with each other. In circuit switching 
paths, this type of blockage may in fact prevent 
the needed set of communication paths from being 
set up. Packet switching communication allows 
interfering communication paths to be set up, but 
as communication commences, packets from two or 
more communications may interfere with each 
other, thereby slowing down the rate of computa- 
tion of the processes that interfere with each oth- 
er. How much_ interference is encountered 
depends on run-time performance aspects. 


To avoid this run time communication delay, it 
is desirable to allocate communication paths in 
such a way that they will never interfere with 
each other. By so doing, process computation 
speeds can be kept at a maximum. Figure 2 shows 
the effect of communication blockage in an 
SW-banyan. In Figure 2.a, the communication path 
between processors A and B shares links in the 
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network with itself (the downward path shares 
links with the upward path), but because these 
two processors are communicating with each other 
and are aware of each other's communication activ- 
ity over this path, they do not interfere with each 
other. However, processors D and E can interfere 
with each other; if D wants to talk to C and E 
wants to talk to F simultaneously, their communi- 
cations will interfere with each other. The amount 
of delay experienced by the processors will vary 
tremendously with the application and timings of 
individual processes. This is an example of com- 
munication blockage. It is this blockage which the 
mapping method presented herein prevents. 


THE PARTITIONING METHOD 


The partitioning method is presented in this 
section. Given a computation structure S, a 
strong bipartitioning of S into two substructures 
$1 and S2 is found such that no two edges in the 
cut set are connected to the same node. This 
strong partitioning is more restrictive than simple 
partitioning. The substructures S1 and S2 can 
then be mapped onto SW-banyans which have 
f(x) = 3 for O<x<L-l. Such _ banyans 
recursively have at least three component 
SW-banyans at every node level. To map S onto 
such a banyan, S1 is assigned to one of the three 
component SW-banyans at level 1, S2 is assigned 
to another, and a third is used to form the inter- 
connections. Figure 3 illustrates this process. 
Note that nodes at level 1 are used to form the 
interconnections between the processors assigned 
to S1 and S2. 


Once the first strong bipartitioning has been 
performed and the substructures have been 
assigned to separate level 1 component banyans, 
the mappings of S1 and S?2 are recursively solved 
using these component banyans. This process con- 
tinues until the substructures are no_ longer 
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decomposable, that is, when each substructure 
consists of only a single processor. Nodes at lower 
and lower levels will be used to interconnect suc- 
cessive substructure decompositions. It is appar- 
ent that this mapping method requires as many 
levels of nodes as the number of strong parti- 
tionings plus one. This mapping method will 
become clear as a few examples of its application 
are presented. Following these examples, several 
ways of extending the method are presented. 


Partitioning the Cube 


One simple problem structure is the cube, con- 
sisting of eight nodes (processes) and 12 edges 
(communication channels). The goal of the 
resource scheduling scheme is to assign one 
processor (apex) to each process node and one 
base and communication path to each communi- 
cation channel. To begin with, the cube can easi- 
ly be strongly partitioned into two disjoint 
squares. Each square can then be further parti- 
tioned into two lines, and each line can be parti- 
tioned into two nodes (processors). See Figure 4. 
None of these partitionings violate the strong par- 
titioning interconnection constraints, that is, no 
two edges in any cut set connect to the same 
node. Using the strong partitioning mapping 
method, the two squares will be interconnected 
through nodes at level 1 since they are the sub- 
structure results of-the first partitioning; the 
lines of the square will be interconnected through 
nodes at level 2; and the nodes in the line seg- 
ments will themselves be interconnected through 
nodes at level 3. Clearly then, a banyan with at 
least four node levels (0 through 3) is required 
for mapping the cube using this method. Since 
every level x must have f(x) 23, O< x <L-l, 
and since four levels of nodes are required, map- 
ping the cube using this method requires a 
banyan with at least f* = 3% = 27 bases. It is 
shown later how this number can be lessened. 


Following the strong’ partitioning of a 
structure, processors and base nodes must be 
assigned to the decomposed structure. This proc- 
ess is now explained below. Given a base distance 
decomposition of an SW-banyan [DeGroot], the 
nodes within a given level x are numbered from 
left to right, from 1 through n(x). The numbers 
of the apex nodes of crossbars between levels L-1 
and L differ by 1. Crossbars between levels L-2 
and L-1 have their apex nodes differing by s(L). 
Crossbars between levels L-3 and L-2 have apexes 
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differing by s(L-1)s(L), and so on. In general, 
crossbars between levels x and x+1l have apex 
numbers differing by 


d(x) =§ 1, for x = L-1 
s(x+2)s(x+t3)...s(L), for 0 < x < L-2 


Figure 5 illustrates this property. This 
SW-banyan will be used to demonstrate the map- 
ping. In the cube partitioning, the line segments 
are formed by connecting together two processors 
through a node at level 3, and thus crossbars 
between levels 2 and 3 are used to form the map- 
pings of the nodes in these line segments. In the 
(3,3) SW-banyan of Figure 5, these crossbars 
have apex nodes differing by one. The line seg- 
ments are interconnected into squares through 
nodes one level up, at level 2, requiring crossbar 
interconnections between levels 1 and 2. The 
interconnected processors must thus have num- 
bers differing by d(1) = s(L) = 3. Similarly, the 
level 1 interconnections of the squares require 
that the interconnected processors differ by 
d(0) = s(L-1)s(L) = 9. 


The beauty of this mapping method is that only 
two arbitrary choices are required. For the cube, 
one line segment is selected and the nodes 
(processors) connected by it are numbered so that 
they differ by only one. Clearly many such num- 
berings are possible. Without loss of generality, 
processors 1 and 2 can be selected. Assignments 
to the processors in the other line segment in the 
corresponding square are made simply by assign- 
ing them numbers that differ from their nodes to 
which they are being connected by d(1) = 3. See 
Figure 6. Node 1 in the first line segment con- 
nects to node 1 + 3 = 4; node 2 connects to node 
2+3=5. Note that the new line segment 
processor assignments, processors 4 and 5, differ 
from each other by only one as required. Node 
assignments can now be made in the other square. 
To do so, nodes are simply chosen which differ by 
d(0) = 9 from their connecting nodes. The new 
node assignments clearly maintain the required 
node relationships, as shown in Figure 7. Once 
the nodes have all been labeled, the cube can be 
loaded onto the banyan. Figure 8 shows one pos- 
sible loading using the assigned processors. It is 
apparent from the figure that even with the given 
node labeling many loadings are possible simply by 
choosing different bases. In the figure, each base 
selected is only one of three possible. 


To see how the interconnections between two 
substructures are achieved, note that processors 
1 and 4 are interconnected at level 2 through node 
7, implying the use of base node 7 as well. Node 
7 differs from 4 by d(1) = 3, just as does 4 from 
1. Similarly, nodes 2 and 5, which differ by 3, 
are interconnected through node 5 + 3 = 8 at level 
2 and at the base. Nodes 1 and 10, which differ 
by 9, can be_ interconnected using node 
10 + 9 = 19, and so forth. The fact that the third 
node is always available for interconnecting the 
other two nodes is due to the constraint that 
f(x) 23, O<x< L-1. This ensures that there 
will be at least three base nodes in every crossbar 
within the banyan. Each of the three nodes will 
naturally reside within three different component 
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banyans. By having assigned node numbers in the 
above manner, the availability of the third node 
for interconnecting the other two is ensured. 


A most important point to notice about the 
loaded structure is that given any communication 
path, if the two processors on this path are com- 
municating, there can be no other two communicat- 
ing processors that interfere with the first two. 
This is true whether the communication is accom- 
plished with packet switching or circuit switching. 
An important: point to notice about the mapping 
method itself is that it takes time linear with the 
number of processors and communication channels 
to be mapped once the partitionings have been 
determined. Determining the partitionings may be 
much more complicated, and in fact, it is unclear 
at present how effectively this could be 
automated. 


4-Nearest Neighbors 


Another common problem structure is’ the 
4-nearest neighbors structure. In this structure, 
the nodes are organized in rows and columns, as 
elements of a matrix. Except for the outside ele- 
ments, node (i,j) is connected to nodes (i+1,j), 
(i-1,j), Ci,j*1), and (i,j-1). Strong partitioning 
can be performed on this structure just as well as 
it was on the cube. Figure 9 shows a set of suit- 
able strong partitionings of the 4x4 4-nearest 
neighbors structure. Because 4 levels of partition- 
ing are required, 5 levels of nodes are required. 
And again, because fanouts of 3 or more are 
required, at least 81 bases are required with this 
method. Node assignments can be made as before, 
with the restriction that the fourth level parti- 
tionings must have nodes that differ by 1, third 
level nodes must have nodes differing by s(L), 
and so on. In Figure 10, the needed differences 
are labeled along the connecting edges. Figure 11 
shows one possible mode labeling using the 
required node differences. Finally, Figure 12 
shows’ this structure loaded onto the (3,3) 
SW-banyan. Again, it is important to notice that 
no two active communication paths can possibly 
interfere with each other. 


EXTENDING THE PARTITIONING METHOD 
This section describes several ways in which 


the partitioning method can be extended to make it 
more powerful. 


Higher Level Partitionings 


So far, example applications have recursively 
strongly bipartitioned a structure into two sub- 
structures at each step. These two substructures 
were assigned to two- separate component 
SW-banyans and a third was used to interconnect 
the two. This process required at least three com- 
ponent SW-banyans at each step, or equivalently, 
f(x) 2 3 for 0 < x < L-1. When f(x) > 3 for some 
node level, it might be possible to strongly parti- 
tion a given structure into more than two sub- 
structures. If a structure is strongly partitioned 
into n substructures and mapped onto separate 
level x+1 component SW-banyans, n-1 other com- 
ponent SW-banyans at the same level can be used 
to form the necessary interconnections between 
the substructures. The total number of required 
component SW-banyans at level xtl is then 
2n - 1. Thus the fanouts of the nodes at level x 
must be 2n - 1 or more. Note that 2n - 1 is 
always odd. Equivalently, if f(x) = n for some 
level x, O< x < L-1, a given structure can be 
strongly partitioned into (n+1)/2 substructures, if 
possible. Figure 13 illustrates the process of 
strongly partitioning a structure into more than 
two substructures. As an example of a structure 
that may be strongly partitioned into more than 
two partitions, consider the 6x6 4 nearest neigh- 
bors structure. A set of suitable strong 
bipartitionings and tripartitionings of this struc- 
ture is illustrated in Figure 14. 
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In some cases, n substructures do not require 
n - 1 component SW-banyans for interconnection, 
thus allowing fanouts of less than 2n - 1. For 
instance, three substructures might easily be 
mapped onto three component SW-banyans and 
interconnected through only a fourth. As an 
example, consider the structure shown in Figure 
15 and the accompanying strong partitioning and 
node labelings. This structure is clearly the 
6-processor pipeline. Both the strong parti- 
tionings and the node labelings obey the con- 
straints described above. Figure 16 shows this 
structure loaded onto the strongly rectangular 
(4,2) SW-banyan. Note that at level 1 the struc- 
ture is strongly tripartitioned but that only four 
component SW-banyans are used to map the three 
substructures and interconnect them, instead of 
2x3 - 1 = 5. The mapping method is not formally 
described here for use with fewer than 2n - 1 
component SW-banyans, but it is apparent that it 
is sometimes possible. 


Sometimes a substructure is simple enough or 
small enough to be mapped by other methods than 
the partitioning method and may consequently 
require component SW-banyans with fewer levels 
than might be required by the _ partitioning 
method. One trivial instance involves substruc- 
tures with n or fewer processors interconnected 
by n or fewer communication paths. Such sub- 
structures can easily be mapped onto any crossbar 
of size nxn or more. For example, the cube can be 


strongly bipartitioned into two squares. Each 
square contains 4 processors and 4 communication 


paths. Clearly these squares (or rings) can be 
mapped onto 4x4 crossbars. Since only one strong 
partitioning is required to reduce the cube to 
substructures that can be mapped onto crossbars, 
only three levels of nodes are required. Figure 17 
shows the cube loaded onto such a banyan. 


Interior Bounce-Back Points 


It is possible to design communication networks 
in which all communication turns around in the 
middle of the network where the downward paths 
meet the upward paths instead of traveling all the 
way through the network to a base before turning 
around. In such cases, it may be possible to omit 
the allocation of certain base and intermediate lev- 
el nodes. Since it is the availability of nodes and 
links that determines in large part the success of 
a potential mapping in a busy system, the more 
nodes and links that are available, the better the 
chance of success for the mapping. In addition, a 
larger number of structures will be able to 
coreside on the network at a given time. The 
mapping method presented is presently unable to 
take advantage of idle nodes that have busy nodes 
above them, and thus is presently incapable of 
satisfactorily dealing with such networks except 
as regular networks. 


Packet Switching Enhancements 
When packet switching is used as the form of 


communication, greater flexibility is allowed in the 
mappings. If it is impossible to map a given struc- 
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ture, say because no suitable partitionings may be 
found, then communication channels may be 
dropped from the computation structure graph one 
by one until partitionings are possible. This 
reduced graph may then be mapped onto the net- 
work, and those communication channels which 
were dropped may be arbitrarily assigned. These 
arbitrarily assigned channels may result in inter- 
ference with the regularly mapped communication 
channels, but the number of interfering channels 
will hopefully be minimized by the partitioning 
mapping method. 


Loading Onto Busy Networks 


If the system already has some jobs loaded onto 
it, a given node labeling for a partitioned struc- 
ture may not be loadable onto the network due to 
the indicated nodes, links, or bases being in use. 
In such cases, adding 1 modulo the number of 
apexes to every labeled node produces another 
labeling that still satisfies the needed labeling 
constraints. Each such labeling can be tried until 
all have been tried but failed or until one suc- 
ceeds. In addition, it is possible to try for inter- 
connections of substructures at higher levels. For 
example, when a cube is decomposed into two 
Squares, these squares can be _ interconnected 
through many different levels. Although doing so 
is easy, how this is done has not been explained 
in this presentation. 


Mapping Onto Non-rectangular SW-banyans 


As mentioned at the beginning of this paper, 
the partitioning mapping method is suitable for 
both rectangular — and non-rectangular 
SW-banyans. So far, all the examples have dealt 
solely with rectangular SW-banyans. Figure 18 
however shows a cube mapped onto an 8x27 
non-rectangular SW-banyan. Given the particular 
orientation of the banyan in the figure, it is easy 
to see how certain processors in rectangular 
SW-banyans are unnecessary for the partitioning 
method. For instance, the rightmost component 
SW-banyan at level 1 is used solely to interconnect 
processors which are above the leftmost and mid- 
dle level 1 component SW-banyans. Therefore, the 
processors above this third (rightmost) component 
are not needed, and indeed, in most cases, will 
probably be rendered useless by the mapping. 
The same situation exists recursively within each 
component banyan. As a consequence, it appears 
that fanouts of 3 and spreads of 2 are best suited 
to this mapping scheme. A _ network with this 
topology was serendipitously selected as the TRAC 
network early in the project. | 


Non Strongly-Partitionable Structures 


Recent studies have shown that the partition- 
ing method is extendable to structures that are 
not strongly partitionable. Thus the number of 
structures that may be mappable with the parti- 
tioning method is larger than first believed. Fur- 
ther research is needed in order to determine the 
limitations of the partitioning mapping method for 
non-strongly partitionable structures. 
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Figure 18 


OTHER PARTITIONABLE STRUCTURES 


As noted near the beginning of this 
paper, the partitioning method as presented is 
applicable to only a very restricted class of com- 
putation structures. These structures must be 
able to be partitioned into a number of disjoint 
substructures such that no two edges in any cut 
set are connected to the same node. Fortunately, 
there are many useful structures in this class. 
Figure 19 illustrates just a few of the many such 
structures. Figure 20 shows several structures 
that do not fall into this class. These structures 
will be mappable with an extended partitioning 
method. 


Figure 19 


Figure 20 


SUMMARY 


A resource allocation scheme has been pre- 
sented for mapping certain classes of job struc- 
tures onto regular SW-banyan networks, both 
rectangular and non-rectangular. This method 
works best when the job structures to be mapped 
are strongly partitionable. For a given partition- 
ing of a structure, an assignment of processors 
and base nodes can be made in linear time. Many 
possible resource assignments are possible for a 
given partitioning, allowing greater probabilities 
of loading a structure onto a busy system. Several 
ways in which the basic partitioning method can 
be extended have been discussed. 
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Abstract 


This paper describes an approach to connect- 
ing hardware resources for high-performance compu- 
tation. Two basic algorithms are designed for 
configuring binary tree topologies. The configur- 


ing command can be issued from any processing mode. 


The algorithms can select proper modes for connec- 
tion while maintaining good utilization of proces- 
sing nodes. 


I. Introduction 


Recently, due to VLSI technology, the cost of 
hardware has been drastically decreased. Research- 
ers attempt to add more and more hardware to com- 
puting systems. In order to improve computing sys- 
tem, they are using multiple processor instead of 
Single processor to achieve higher performance. 
And generally these processors are interconnected 
by a communication network. However, an improper 
communication structure is liable to incur exces- 
Sive interprocessor communication that is referred 
to as saturation effect [1]. And this excessive 
interprocessor communication would seriously de- 
grade the overall performance of multiple proces- 
sor systems. Therefore, in designing a multiple 
processor system, communication structure is a 
major factor as well as to be considered. 

In order to minimize interprocessor communi- 
cation and improve resource utilization, several 
different techniques have been employed to con- 
figure processors into some topologies, such as 
linear array, star, loop, cube, binary tree, etc. 
[2-6]. 


for some specific tasks. And the performance will 


be improved provided that processors can be dynami- 


cally reconfigured into suitable topologies needed 
in the computation. The main idea behind this 
work is to design two basic configuration algori- 
thms to connect processors into binary tree topo- 


logy through a communication subnet - starnet, which 
is able to support some other configuration topolo- 


gies[7]. And configuration commands could be issued 
by any processor. The remainder of this paper is 
organized as follows. Section II presents both 
system overview and mathematic models. In Section 
III the configuration algorithms of binary trees 
are presented. 


II. System Overview and Models 


A. Introduction to Star 

Star is a collection of N heterogeneous proc- 
essing mode as well as a communication subnet- 
starnet through which processing modes are able to 
communicate with each other. Generally, the num- 
ber of processing nodes (the size of starnet) is a 
power of two, N=2™. The starnet is a cascade of a 
baseline network and a bit reversal permutation P. 
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However, a Single topology is only suitable 


Section IV contains the conclusions, 
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‘address of destination. 


TST1V2 


In star, processing nodes are numbered from 0 to 
N-1. For convenience, each node is labeled with 
an n-bit binary number, which is referred to as 
address. 

The switching methodology employed in the 
starnet is circuit switching, that is, a physical 
path is actually established between processing 
nodes. 

To a specific destination node, all source 
nodes issue the same routing tags no matter where 
they are. And the routing tag is exactly the 
More precisely, let 


source address S = S S ...5, and destination 
n-l “n-2 y) 


D = dol a, respectively. At stage 


i, the requested switching element can be describ- 
ed by 


address 


(Qi 2 hoed:, os 


n-1 n-i 50° * Snei-2> 4? 


where 0 < i < n-1. And the requested link can be 
described by 


(doe dg Syr+ +S 


Jy» 


n-i-1 
where 0 < 1 <n. 


B, Description of Functions 

~ In this subsection, we examine the problem of 
conflict-free mapping on the Star. Before the 
formal description of conflict-free mapping is 
presented, two basic functions $ and y are intro- 
duced, which facilitates the detectionof conflicts 
among connections of starnet. 
DEFINITION. Let U = U'M + u,V = V'M + v where 
U, V, U', V', M' u, v, are n-bit binary numbers 
and u, v < M, M= 2™, 0 <m<n. A function. ¢ is 
defined as KA VY 


where m makes U and V have u = v. 
DEFINITION. Let U, V be n-bit binary numbers, 
function w is defined as 


w(U, V) = o(P(U), P(V)). 


For example, if U = #119119 and V = $1991 then 
o(U, V) = 3 and wW(U,V) = 2. Two extremes of ¢ 

are (1) m=m, it leads to U = V (2) m= 9, one of 
them is an even number and the other is an odd 
number. Functions and p also have the following 
properties. Note that the following bit manipula- 
tion is based on module-n. 


max [m], 


LEMMA 1. If o(U, V) = k, then for any integer C, 
o(U + C) =k. - 

LEMMA 2. Let U = U'M+ u and V = ViIM+v. If 

V > U, for any integer number M = 2™, then V' > U', 
Morevoer, if V>U and u > v, then V'>U'. _ 


LEMMA 3. If U, V are n-bit binary numbers and 
PUM) a Ne ARM sey: And ae -sae, 


Arid: ae a Cs Vi) eek. hen 
o(U, V) < n-k'!. 


LEMMA 4. If o(U,V) = k<n, then 
o(2U, 2V) =k +1. 
LEMMA 5. If #(U, V) = k and O<k<n, then 


b(U+1, V) <n-k. 


For convenience, let an ordered pair (S,D) 
represent a connection between two nodes S and D, 
where S 1s source node and D is destination node. 
In starnet, a conflict of a 2x2 switching element 
is defined as a Situation in which two connections 
from different input ports and with different 
routing tags compete the same output port. There- 
fore, if a set of connections (S,,D,),..-(S,, Dy) 
do not cause any conflict, then they are mappable 
on star: 7 
THEOREM 1. In star, a set of connections (S1,D,), 
...(S,,D,.) are mappable if and only if for any 
two connections (S;,Dj) and S;,D3) where Sj; # S; 
and D3 # Dj, the following inequality is satisfied 


b(S;, Sj) % v(D;, ae <n. 


Binary Tree 


The following algorithm is used to configure 
a number of processors into an n-level full binary 
tree which could be rooted from any arbitray node 
R. And the left son of a predecessor 5; is denoted 
as Diol and right son is denoted as Dj 2: 


ALGORITHM 1. (Al) 
Procedure top-down tree (R, n); 


begin 
for 1: = 1 to n-1 do 
begin | 
k: = 2i-l, 
for j k to 2k - 1 do 
begin 
Sie cee qr er UR 1; 
5 J 
Lor: 5 = k to 2k - 1 do 
begin 
E =jtR-1; 
Deas = 2) *¢- R= 1; 
Dj 2° = 2) + R; 
end 
end; 
end; (* end of A3 *) 


An example of a binary tree, generated by A, 
is shown in Figure l(a). The following theorem 
demonstrates the mappability of Al on Star. 
THEOREM 2, Any full binary tree which is generated 
by Al is mappable on Star. 

(For a proof, see Appendix). 

The mapping of the previous example is shown in 
Figure 1(b). An alternative algorithm which gene- 
rates an n-level binary tree in a bottom-up 
fashion is described as follows. 

ALGORITHM 2 (A2). 


Procedure bottom-up tree (R,n); 


begin 
for i: = 1 to n-1 do 
begin 
Piz gi-l. 
for j : =k to 2k - 1 do 
begin 
oes aye A Re, 


115 


-~ -27 +R-1: 
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end; 
end; (*end of A4 *) 


An example of binary trees which is genera- 
ted by A2 is shown in Figure 2(a). 
THEOREM 3. <Any binary tree which is generated by 
A2, is mappable on Star. 
(Proof is similar to Theorem 1). The mapping of 
the previous example is illustrated in Figure 2(b). 


IV. Conclusions 


The main idea behind this work is to design 
two configurabion algorithms for n-level binary 
tree. First of all, the sufficient and necessary 
condition of mappability is found. Then, by means 
of the detection function, the mappability of Al 
and A2 are demonstrated. The result is flexible 
enough to allow any mode to be the root without 
centralized control. Hence, Al and A2 provide a 
solution to configuring processors into tree 
topology in a distrubted fashion. 
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Appendix 


Proof of Theorem 1: | 

Let (U, W) and (V, X) be two connections of a 
binary tree generated by Al. In the following 
discussion, only the condition that U # V is 
considered. 
(i) W, Z are left sons of U, V. 


Let 
U' =U-R+tI1 
V'=V-R+]1 
ane pin. Wy se: 
From Al, 
W= 2U' + R- 1 
Z= 2V' + R = 2. 


From LEMMA 1, From LEMMA 1, 
o(u', V') o(U', V') 
= o(U'+R-1, V'+R-1) = >(U'+R-1, V'+R-1) 
= 9(U, V) = (U, V) 
= k. =k. 
From LEMMA 4, From LEMMA 4, 
@(2U', 2V') =k +1 o(2U',2V') 
Again, = ¢(2U'+R-1, 2V'+R-1) 
=$(2U', 2V"') =k+l. 
— ' 
oe R-1, 2V' + R-I) From LEMMA 5, 
ay rhe y(2U'+R-1, 2V'+R) <n - k. 


From LEMMA 3, 
wW, Z)<n-k-d. 

Combine the above two inequalities, 
¢(U, V) + v(W, Z) <n -1 <n. 


Hence, 


w(W, Z) <n- k. 


Combine the above inequalities, 


o(U, V) + ¥(W, 2) <n. 
As a result, any two connections of T are 


conflict-free in Starnet. Therefore, T is 
mappable on Star. 


(ii) W, Z are the right sons of U, V. 
In a Similar way, it can be derived that 
o(U, V) + p(W, Z) <n. 
(iii) W is the left son of U and Z is the right 


Q.E.D. 


son of V. 
Let 
U' =U-R+i 1 
Vi=V-R#1 
and 
o(U', V') =k. 
From Al, 
W = 2U' + R - 1 
Los 2V\ -R, 


Figure 1. An example of top-down tree 
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Figure 2. An example of bottom-up tree 


Performing the Shuffle with the PM2I 
and Illiac SIMD Interconnection Networks 


Robert R. Seban 
Howard Jay Siegel 


Purdue University 
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West Lafayette, Indiana 47907 


Abstract--Three SIMD single stage interconnection 
networks which have been proposed and studied in the 
literature are the Iliac, PM2I, and Shuffle-Exchange. 
Here the ability of the Illiac and PM2I networks to per- 
form the shuffle interconnection in an SIMD machine 
with N_ processors is examined. A lower bound of 
3V/N/2 transfers for the Illiac to shuffle_data is derived. 
An algorithm to do this task in 2/N-1 transfers is 
given. A lower bound of log.N transfers for the PM2I to 
shuffle data has been published previously. An algo- 
rithm to do this task in log,.N+1 in transfers is 
presented here. 


1. Introduction 


This paper extends SIMD interconnection network 
studies presented in [28, 31]. In particular, the ability of 
the PM2I and Illiac single stage interconnection SIMD 
machine networks to perform the shuffle interconnection 
is examined. In [28] it is shown that a lower bound on 
the number of transfers needed for the PM2I network to 
perform the shuffle is log,N, where N is the number of 
processing elements in the SIMD machine. The algo- 
rithm presented here requires only (log,N)+1 transfers. 
This algorithm is used as basis for an algorithm to do 
the shuffle with the Illiac network in (2VN)-1 transfers. 
This compares favorably an earlier result of 4(WN-1) in 
[25]. In addition, a lower bound 3VN/2 on the number 
transfers required for Illiac to do shuffle is proved. 

The model of SIMD machines used is described in 
Section 2. In Section 3 the interconnection networks are 
formally defined. An algorithm to shuffle data using the 
PM2I network is given in Section 4. The lower bound 
analysis and algorithm for performing the shuffle with 
the Illiac network is presented in Section 5. 


2. SIMD Machine Model 


Typically, an SIMD (single instruction stream - mul- 
tiple data stream) machine ee a computer system con- 
sisting of a control unit, processors, N memory 
modules, and an interconnection network. The control 
unit broadeasts instructions to the processors, and all 
active processors execute the same instruction at the 
same time. Each active processor executes the instruc- 
tion on data in its own memory module. The intercon- 
nection network, sometimes referred to as an alignment 
or permutation network, provides for communications 
among the processors and memory modules. Examples 
of SIMD machines that have been constructed are the 
Iliac IV [6] and STARAN [2, 3]. 

One way to view the physical structure of an SIMD 
machine is as a set of N processing elements intercon- 
nected by a network, where each processing element (PE) 
consists of a processor with its own memory. This type 


This material is based upon work supported by the National Science 
Foundation under Grant ECS-8120896. 
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CONTROL UNIT 


PROC. N-| 


INTERCONNECTION NETWORK 


Fig. 1: PE-to-PE SIMD machine configuration, with 


N PEs. 


of configuration is shown in Fig. 1. It is called the PE- 
to-PE organization. The network is unidirectional and 
connects each PE to some subset of the other PEs. A 
transfer instruction causes data to be moved from each 
PE to one of the PEs to which the PE is connected by 
the network. (Here only one-to-one communications will 
be considered, i.e., broadcasting (one-to-many) connec- 
tions are not considered.) To move data between two 
processing elements that are not directly connected, the 
data must be passed through intermediary processing 
elements by executing a programmed sequence of data 
transfers. An alternative to the PE-to-PE SIMD 
machine organization is to position a bidirectional net- 
work between the processors and the memories. The 
PE-to-PE paradigm will be used here, however, the 
results presented will be applicable to the other organiza- 
tion also. 

The formal model of an SIMD machine used here 
consists of five parts: processing elements, control unit 
instructions, processing element instructions, masking 
schemes, and interconnection functions. It is a 
mathematical model that provides a common basis for 
evaluating and comparing the various components of 
different SIMD machines. This model is based on the 
one presented in [31]. 

Each processing element (PE) is a processor together 
with its own memory. There are N PEs, addressed (num- 
bered) from 0 to N-1, where N = 2™. It is assumed that 
the processor contains a fast access general purpose 
register A and a data transfer register (DTR). When 
data transfers among PEs occur, it is the DTR contents 
of each PE that are transferred. At any point in time, 
each PE is either in the active or the inactive mode. If a 
PE is active, it executes the instructions broadcast to it 
by the control unit. If a PE is tnactsve, it will not exe 
cute the instructions broadcast to it. 

The control unit stores the SIMD programs, exe- 
cutes control of flow instructions, and broadcasts pro- 


cessing element instructions to the PEs. An example of 


a control of flow instruction is the loop statement. 


“for i=0 until N-1 do...” 

The processing element instructions consist of those 
operations that each processor can perform on data in its 
individual memory or registers. It is assumed the set of 
processing element instructions includes the capability to 
move data among the registers. The notation “Z — Y” 
means the contents of register Y are copied into register 
Z. The notation ‘“Z+—+Y” means two _ registers 
exchange their contents. 

A masking scheme is a method for determining 
which PEs will be active at a given point in time. The 
PE address masking scheme uses an m-position mask to 
specify which PEs are to be activated, each position of 
the mask corresponding to a bit position in the binary 
addresses of the PEs [28]. Each position of the mask will 
contain either a 0, 1, or X (‘‘don’t care”). The only PEs 
that will be active are those that match the mask for all 
i, 0 <i1< m: if the mask has a 0 in the i-th position, 
then the PE address must have a 0 in the i-th position; if 
the mask has a 1 in the i-th position, then the PE 
address must have a 1 in the i-th position; and if the 
mask has an X in the i-th position, then the PE address 
may have either a 0 or 1 in the i-th position. For exam- 
ple, if N = 8 and the mask is LXO, then only PEs 6 and 
4 are active. Superscripts are used as repetition factors, 


e.g., X°017 is XXXO11. Square brackets will be used to 
denote a mask. Each PE instruction and interconnection 
function (defined below) will be accompanied by a mask 
specifying which PEs will execute that command. For 
example, executing ‘A+DTR [X™!0]” means that 
each even numbered PE is active and loads its A register 
from its DTR. Each odd numbered PE is inactive and 
does nothing. Further information about the use and 

le of PE address masks is in [18, 28, 31, 

34]. 

An interconnection network can be described by a 
set of interconnection functions, where each tnterconnec- 
tion function is a bijection (permutation) on the set of 
PE addresses [28]. When an interconnection function f is 
applied, PE 1 sends the contents of its DTR to the DTR 
of PE f(i). This occurs for all i simultaneously, for 
0<1< N and PE i active. Saying that an interconnec- 
tion function is a bijection means that every PE sends 
data to exactly one PE, and every PE receives data from 
exactly one PE (assuming all PEs are active). In this 
model, it is assumed that an inactive PE can receive 
data from another PE if an interconnection function is 
executed, but an inactive PE cannot send data. To pass 
data from one PE to another PE a programmed sequence 
of one or more interconnection functions must be exe- 
cuted. This sequence of functions moves the data from 
one PE’s DTR to the other’s by a single transfer or by 
passing the data through intermediary PEs. 

In summary, an SIMD machine can be formally 
represented as the five-tuple (N,C,I,.M,F), where: 

(1) N is a positive integer, representing the number of 
PEs in the machine; 

(2) C is the set of control unit instructions, i.e., 
instructions that are executed by the control unit 
in order to control the flow of the program; 

(3) Iis the set of processing element instructions, i.e., 
instructions that can be executed by each active 
PE and act on data within that PE; 

(4) M is the set of masking schemes, where each mask 


interconnection network), where each function is a 
bijection on the set {0, 1, ..., N-1}, which deter- 
mines the communication links among the PEs. 

A particular SIMD machine architecture can be 
described by specifying N, C, I, M, and F. In this paper, 
N = 2™; C includes “for ... until ... do” instructions for 
controlling the flow of loops in the program; I includes 
instructions for moving data among the registers of a 
given PE; M includes PE address masks; and F is varied. 
The assumptions made about the SIMD machine to be 
used as the model are intentionally minimal so that the 
material presented is applicable to a wide range of 
machines. 


3. The Interconnection Networks 


A. Introduction 

In this paper, three networks which can be con- 
structed from a single stage of switches are examined. 
In a single stage network, data items may have to be 
passed through the switches several times before reach- 
ing their final destinations. Conceptually, a single stage 
network can be viewed as N input selectors and N out- 
put selectors, as shown in Fig. 2 [30]. The way in which 
the input selectors are connected to the output selectors 
determines the allowable interconnections. 

The following notation will be used: let N = 2”, 
let the binary representation of an arbitrary PE address 
P be py,-1Pm-2---P1Po9, let p; be the complement of p, , 
and let the integer n be the square root of N. It is 
assumed that —] mod N = N-j mod N, for j > 0. 


B. The Illiac Network 
The ZJlliac network consists of four interconnection 
functions defined as follows: 


Iliac ,,(P) = P+1 mod N 
Illiac_,(P) = P-1 mod N 
Iliacy,(P) = P+n mod N 
Iliac_,(P} = P-n mod N 


where n is assumed to be an integer. For example, if 
N = 16, Ilhac,,(0) = 4. This network allows PE P to 
send data to any one of PEs P+ 1, PE P-1, PE P+n, or 
PE P-n, arithmetic mod N. This is often referred to as 


4c 02 
4c0+4CO 


partitions the set {0, 1, ..., N-1} into two disjoint Fig. 2: Conceptual view of a single-stage network. 
sets, the enabled PEs and the disabled PEs; and “IS” is input selector, ‘‘OS” is output selec- 
(5) F is the set of interconnection functions (i.e., the tor. : 
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Fig. 3: Iliac network for N = 16. (The actual Ilhac 
ITV SIMD machine had_N = 64). Vertical 
lines are +VN and — VN. Horizontal lines 


are +1 and —l. 


a four nearest neighbor connection pattern, as shown for 
N = 16 in Fig. 3. This network was implemented in the 
Iliac IV SIMD machine, where N = 64 [1, 6]. 

Relating this to the conceptual model of a single 
stage network shown in Fig. 2, for each 1,0 <i1<N, 
input selector 1 has lines to output selectors i+1, 1-1, 
itn, and i-n, mod N. For each j, 0 <j < N, output 
selector } gets its inputs from mput selectors j-1, ) +1, 
j-n, and j +n, mod N. Since there is a single instruction 
stream in an SIMD machine, all active PEs must use the 
same interconnection function (connection) at the same 
time. For example, if PE 0 is sending data to PE 1, 
then all active PEs must send data using the Ilhacj, 
connection. 

This type of network is included in the MPP [4, 5 
and DAP [16] SIMD systems. Various properties an 
capabilities of the Ilhac network are discussed in [6, 13, 
25, 28, 31, 32]. 


C. The Plus-Minus 2' (PM21) Network 
The Plus-Minus 2' (PM2I) network consists of 2m 
interconnection functions defined by: 


PM2,,(P) = P+2' mod N 
PM2_(P) = P-2' mod N 


for 0 <i<m. For example, PM2,,(2) =4 if N > 4. 
Since P+2™ 7?! = p-2™! mod N, for all P,O<P<N, 
the interconnection functions PM24(,-1) and PM2_(_,_1) 
are equivalent. Fig. 4 shows the PM2,; interconnections 
for N= 8. Diagrammatically, PM2_; is the same as 
PM2,; except the direction is reversed. This network is 
called the Plus-Minus 2’ since, in terms of mapping 
source addresses to destinations, it can add or subtract 
2' from the PE addresses, i.e., it allows PE P to send 
data to any one of PE P+2' or PE P-2’, arithmetic mod 
N,O<1i1< m. 

In terms of the conceptual model of a single stage 
network (Fig. 2), for the PM2I network, for each j, 
0<j <N, input selector j is connected to output selec- 
tors j+2' and j-2' mod N, for all i, 0<i1<m. For 
each j, 0 <}) < N, output selector j gets its inputs from 
input selectors j-2’ and j+2' mod N, for all i, 
0<i<m. As with the Iliac network, all active PEs 


(a) AO} +--+ 
(b) i rt a Fl i i if rT 


(c) a CH fi r a 7 “ T 


Fig. 4: PM2I network for N = 8. (a) PM24, connec- 
tions. (b) PM24, connections. (c) PM24, 
connections. For the PM2_; connections, 
0 <i< 2, reverse the direction of the ar- 


rows. 


must use the same PM2I interconnection function at the 
same time. 

A network similar to the PM2I is used in the 
‘‘Novel Multiprocessor Array” [24] and is included in the 
network of the Omen computer [15]. The concept 
underlying the SIMDA machine’s interconnection net- 
work is similar to that of the PM2I [36]. The PM2I con- 
nection pattern forms the basis for the data manipulator 
[10], ADM [33], and gamma ae multistage networks. 
Various properties of the PM2I are discussed in [11, 27, 
28, 29, 31, 32]. 


D. The Shuffle-Exchange Network 

The Shuffle-Exchange network consists of a shuffle 
function and an exchange function. The shuffle is 
defined by: 

shuffle( Pi.-1Pm-2:--PiPo) = Pm-2Pm-3---P1PoPm-1 
and the exchange is defined by: 

exchange(P—1Pm-2---P1Po) = Pm-1Pm-2---P1Po- 
For example, shuffle(3) =6 and exchange(6) = 7, for 
N > 8. This network is shown in Fig. 5 for N = 8, 

Consider the conceptual model of single stage net- 
works shown in Fig. 2. For the Shuffle-Exchange single 
stage network, input selector P = p,4...PyPo IS con- 
nected to output selectors p,,-9...PyPoPm-1 (= shuffle(P)) 
and )p,-1---PyPo (= exchange(P)). Output selector 
Tn-p-Tyl%q gets its inputs from input selectors ror,-4-..Tery 
and Yr, -y-.-Tyf. As with the other networks, all active 
PEs must use the same interconnection function at the 
same time. 

Mathematical properties of the shuffle are discussed 
in {14, 17]. The multistage omega network is a series of 
m Shuffle-Exchanges [21]. The shuffle is also included in 
the networks of the Omen [15] and RAP [9] systems. 
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Fig. 5: Shuffle-Exchange network for N = 8. Solid 


line is exchange, dashed line 1s shuffle. 


Features of the Shufle-Exchange are discussed in |7, 8, 
11, 13, 19, 20, 22, 23, 27, 28, 31, 32, 35, 37]. (The ability 
of each of the PM2I and Illiac networks to perform the 
exchange function in just two transfers was presented in 
[31] and is not considered here.) 


4. Shuffling with the PM2I Network 


The following ground rules will be used in the 
design and analysis of the algorithm to perform the 
shuffle with the PM2I network. 

(1) The model and definitions presented in Sections 2 

and 3 will be the formal basis for the results. 

(2) When simulating the shuffle, the data that is origi- 
nally the DTR of PE P must be transferred to the 
DTR of shuffle(P), for all P,O< P< N. 

The time for each algorithm is in terms of the 
number of executions of interconnection functions 
required to perform the simulation. 

The reason for (3) can be seen by considering the 
way in which various instructions can be implemented. 
The instructions in the simulation algorithms can be di- 
vided into three categories: control unit operations (in 
C), register to register operations (in I), and interproces- 
sor data transfers (in F). Control unit operations, such 
as incrementing a count register in the control unit for a 
‘for loop,” can, in general, be done in parallel (over- 
lapped) with the previously broadcast PE instruction, 
thus taking no additional time. Register to register 
operations within a PE will probably involve a single 
chip or, at worst, adjacent chips. The inter-PE data 
transfers will involve setting the controls of the intercon- 
nection network and passing data among the PEs, in- 
volving board to board, and probably rack to rack, dis- 
tances. Thus, unless the number of register to register 
operations is much greater than the number of inter-PE 
data transfers, the time for the interprocessor transfers 
will be the dominating factor in determining the execu- 
tion time of the simulation algorithm. 

In the algorithm below “:” indicates a comment. 
When discussing the algorithms, ‘‘Li” is used as an ab- 
breviation for ‘‘line 1 of the algorithm.” 

To understand the concept underlying the algorithm 
to perform the shuffle, consider the “distance” the shuffle 
moves a data item. The data item in the DTR of PE P, 
0 < P < N/2, is moved to shuffle(P) = 2P, a distance of 
ete P =P. The data item in the DTR of PE 

N/2 <P! <N, is moved to shuffle(P’) = 2P/ +1 
a N, a distance of shuffie(P’)-— P’ =P’ +1. This is 
shown in Fig. 6 for N=8. The algorithm uses the 
PM2 49, PM2 41, PM2 45, er PM2 4-1 and PM2 46 
functions, in that order, to move the DTR data from PE 
P a distance of P, 0< P < N/2, and the DTR data 
from PE P’ a distance of P! +1 mod N,.N/2 =< P! -<_N. 
This is also shown for N = 8 in Fig. 6. Note that in L3 


(3) 


origin distance 
PE 


number 


distance moved 
by PM2I 


to L5, for 1 <j < m~1, all of the data of interest is in 
even numbered PEs. 


Algorithm to perform the shuffle with the PM2I: 


3 A «- DTR pO) -even PEs save DTR in A 
it PM2 +0 [x™11] :odd PEs send to even PEs 

L3) for} = 1 until m—1 do 

i” A += DTR[X™7!1X!"10] :do if p; = 1, pp = 0 
(L5) PM2,;[X™ 0]  :even PEs send hoi 

(L6) PM245 tx 19] :even PEs send to odd PEs 
(L7) DTR — A [X™!0]  :reload DTR in even PEs 


This algorithm used m+1 inter-PE data transfers 
and m+1 register to register moves. The operation of 
this algorithm for N = 8 is shown in Tab. 1. For exam- 
ple, consider the data item initially in the DTR of PE 5 
(= 101). PE 5 does not match the mask in L1 ([XX0)). 
PE 5 does match the mask in L2 ( ((XX1 ) ) and me data is 
moved to PE PM24,(5) = 6 ( = 110). PE 6 does match 
the mask in L4 when j = oe ((X10]) and the data is 
moved to the A register of PE 6. The data is unaffected 
by L5 when j = 1 (since it is not in the DTR). PE 6 
does match the “mask in L4 when j = 2 ({1X0]) and the 
data is moved to the DTR of PE 6. PE 6 does match 
the mask in L5 when j = 2 ([XX0]) and the data is 
moved to the DTR of PE PM2,.(6) = 2. PE 2 does 
match the mask in L6 ([XX0]) and the data is moved to 
the DTR of PE PM24;,(2) = 3. PE 3 does not match 
the mask in L7 ([XX0)). Thus, the data from PE 5 is 
moved to PE 3 = shuffle(5). This is shown by the dot- 
ted line in Tab. 1. 

To prove the algorithm is correct, pa ueuon will be 
used (assume all arithmetic is mod N). The induction 
hypothesis (proven correct below) is that after executing 
PM2,; in L1 (for j = 0) or L5 (for 1 <j < m) the data 


originally in the DTR of PE G =g,,}...2)89 will 
currently be | ,in PE P = Pm-t---PiPo = 
(mn—1---8) +28 41)*2 7! + (g)..-8180)*2. (When j =0, 


= (gy. Eo81)*2 + (gg)*2.) The data will be in the A 
register if g, = 0 and in the DTR if g; = 1. 

Thus, when j = m-l, the data originally from PE G 
is in PE (c.,- 1--- 1 Zo) *2. The data item from the DTR of 
PE (g,,-1---2180)*#2 is moved to PE (g,,-1---818)*2 + 1 by 
L6; which - is correct since this data item is from a PE 
where & = 8m-1 — 1, so shuffle(G) = 2*G +1. The 
data item from the A register of PE (g-1..-8180)*2 is 
moved to the DTR of that PE by L7; this is correct 
since this data item is from a PE where C= fa = 0, 
so shuffle(G) = 2%G. 

To complete the correctness proof it must be shown 
that the induction hypothesis is true. 

Basis: j = 0. 


Fig. 6: The idea underlying 
the algorithm for the 
PM2I to perform the 
shuffle, shown for 


N=8 


Tab. I: 
N'=.8. 
0<P <8. 


Initial ae L2 j= ey a |= 
DTR DTR DTR DTR 


000 000 

001 001 - 

010 010 010 001 001 010 

O11 011 - - - - 

100 100 100 O11 100 ; Oll 

101 | 101---]...- : : : 

110 110 110°)": 101--+}- 101 110 

111 lll - - - - 
Case 1: The data item from the DTR of PE 


G = g,-1---828,0. This data item is moved to the 
A register of that PE by Ll. Since gy = 0, 
G = (g.,-1---B281)*2 + (go)*2 =P. This data is 
not moved by L2. It remains in the A register 
and gy = 0. Thus, the induction hypothesis is 
true for j} = 0 for this case. 

Case 2: The data item from the DTR of PE 
G = p-1---S28,1. This data item is not moved by 
L1. It is moved to the DTR of PE P=Gt1 
by PM24, in L2. Since gg=1, Gti 
Em-1--Se811 + 1 (Sm—-1---B281)*2 + 2 
(Gn—1---Go81)*2 + (go)*2 =P. The data item is in 
the DTR and gy, =1. Thus, the induction hy- 
pothesis is true for } = 0 for this case. 


Induction Step: Assume true for ) = k — 1 and show true 


for} — k. 
Case 1: The data item from the DTR of PE 
G = g,-1---S28180, Where g,- es = 0. From the in- 


duction ny oothets when j = k-1, this data item 
is in the A register of PE P = (c.,- ee eee 
+ (84-1---B180) #2. 

Subcase la: p, = 1. The A register data is moved to 
the DTR of PE P by L4 and then to the DTR of 


PE P + 2* by bo: Recall P = p,-1---P1Po = 
(Em-1-- Bi + 18k) *2" + (Sic £180) *2. Since 
Z,-1 — 0, (Og, -».. 6180) *2 <2*. Thus, if P=) i 


it must be that , — 1. Since g, = 1, P +9 


(Bt +1)#2" + (B-15-8180)*2 + 2 
Bm-1-Ske+ 120) TBE FH (S-1---8 180) #2 + Qk 

= (Sm-1---Bk +1)*20 (18x -1---8180)*2 
= (Sint Bk + eee! + (GS p-1---B1 80) *2- 

Furthermore, the data is in the DTR and g, = 1. 

Thus, the induction hypothesis is true for } =k 
for this subcase. 

Subcase 1b: p, = 0. The A register data is kept in 
the A register of PE P and not moved by L4 or 
L5. As in Subcase la, since g,, =0, 
(Og),-9.. &180)*2 < 2*. Thus, if p, = 0, it must be 
as &, = 0. Since g,. = 0, P = 

Ex +19) pe als 1---8 180) *2 
ie Bk +1)* + (&,.--8180)*2. 
Furthermore, a data is in the A register and 
g, = 0. Thus, the induction hypothesis is true for 
j = k for this subcase. 

Case 2: The data item from the DTR of PE 
G = €,-1---£180, Where g,-; = 1. From the induc- 
tion hypothesis when j = k-1, this data item 1s in 
the DTR of 


Example of the algorithm for performing the shuffle using the PM2I when 
It is assumed that initially the DTR of PE P contains the integer P, 


ae 
j= pee L6 L7 
TR DTR DTR | DTR | PE 


100 000 {| 000 
100 


11 101 

ss = ae 101 Saga aierie 
010 | 010 110 

- - 110 
O11 | O11 11] 


121 


111 


P 1-810) *2, 


he DTR register data is moved 


Bx +18k) + 2° + (gy 
Subcase 2a: p, = 1. 


to the A register of PE P by L4 and is not moved 


< (Sn-1 


© L5. Recall Pm-1---PiPo = (Sm-1--Bk +18k)*2" + 
iar -8189)*2. Since gy-1 = 1, (8,-1---8180)*2 = 
+ (8 9---£129)*2. Thus, if Px = 1, it must be 
hai g, = 0. Since g, = 0, P= 
(Siy—1- x +10)*2 oF ASk-1:--8180)*2 
= (Sn-1---Bk +1)* + (8 x---8189)*2. 
Furthermore, the data is in the A register and 
g,. = 0. Thus, the induction hypothesis is true for 
) — k for this subcase. 
Subcase 2b: p, = 0. The DTR register data is kept 
in the DTR register of PE P (not moved by LA). 
It is then moved to the DTR of PE P + 2* by 
L5. Since Sey = 1, (Sk-1---B180)*2 
OF Gio 8180) *2. Thus, if Pk = 0, it must, be 
that g. = 1. Since g, =1, P+a2k= 
(Sm—1--Bk +4)*28 7) + (Boy BS )*2 as in Sub- 
case la. Furthermore, the data is in the DTR 
and g, =1. Thus the induction hypothesis is 
true for } = k for this subcase. 


5. Shuffling with the [lliac Network 


In this section the use of the Iliac network to per- 
form the shuffle will be examined. First, it will be 
shown that a lower bound on the number of transfers 
(executions of Illiac interconnection functions) needed is 
3n/2. Then, an algorithm requiring 2n—1 transfers will 
be presented. 

To show that a lower bound on the number of 
transfers is 3n/2, four of the N data moves which the 
shuffle performs will be considered. These are: 

a) from PE (N/4 — n/4) to PE (N/2 — n/2) 

b) from PE (N/2 — n/2) to PE (N — 

c) from PE (N/2 + n/2—1) to PE (n- 1) 

d) from PE (3N/4 + n/4—- 1) to PE (N/2 + n/2- 1) 


For N = 64 these correspond to: (a) 14 — 28, (b) 28 > 
56, (c) 35 — 7, and (d) 49 — 35. All four of these 
moves are done simultaneously when the shuffle inter- 
connection function is executed. It will now be shown 
that the Iliac cannot do all four in less than 3n/2 
transfers, ie., at least 3n/2 transfers are needed. To 
simplify the presentation, the N = 64 values will be used 
to demonstrate the bound. The result obtained in this 
way is directly generalizable by substituting (n—1) for 7 
(N/4 — n/4) for 14, (N/2 — n/2) for 28, (N/2 + n/2 — 1) 
for 35, (3N/4 + n/4 — 1) for 49, (N — n) for 56, Illiac,,, 
for Iliac +g, and Illiac_, for Iliac». 


In order to more easily visualize the data move- 
ments in the Illiac network the “wrap-around” connec- 
tions (e.g., 7 to 8, 56 to 0) have been “unwrapped” by 
drawing eight projections of the network, as shown in 
Fig. 7. The actual network is labeled ‘‘C” for center, 
and the eight projections are labeled NW (north west), N 
(north), NE (north east), W (west), E (east), SW (south 
west), S (south), and SE (south east). Thus, each PE is 
represented nine times: once in the original (center) net- 
work, and once in each projection. 

For example, consider the data movement from PE 
7 to PE 8 using the Ihac4, function. Normally, PE 7, 
which is in the rightmost column of the Ilhac network, 
connects to PE 8, which is in the leftmost column, using 
a ‘‘wrap-around” connection. For purposes of this dis- 
cussion, the data from PE 7 in C will be moved to C’s 
PE 8 equivalent in the E projection. 

In order to draw the projections, two constraints 
must be satisfied. 

(1) Each projection has to be topologically isomorphic 
to the Ihac network. 

(2) Each projection must have the proper adjacency to 
the C network and the other projections. 

Proper adjacency means. that two PEs, each from 

different projections, are drawn adjacent to one another 

if and only if they are connected in the original network. 


As an example of this, consider 7 in C, 63 in N, 0 in NE, 


and 8 in E. 

One could continue generating more of these projec- 
tions ‘‘ad infinitum” to represent all possible implemen- 
tations of all possible moves. However, the goal here is 
the show that the set of moves (a) through (d) above 
cannot be done in less than 3n/2 steps. Therefore, pro- 
jections which would involve more than 3n/2 steps to do 
any of (a) through (d) individually are not of interest 
and are unnecessary. 

The lower bound proof is organized as follows. 
First it will be shown that there are only five sets of Illi- 
ac function executions that can perform both the 
28 — 56 and 35 — 7 moves in less than 3n/2 steps (Fig. 
7 and Tab. 2). Then it will be shown that there are only 
five sets of Iliac function executions (which happen to be 
different from the first five sets) that can perform both 
the 14 — 28 and 49 — 35 moves in less than 3n/2 steps 
(Fig. 8 and Tab. 2). Finally, it will be shown that no 
single set of less than 3n/2 Illiac function executions can 
perform all four moves (Tab. 3). 


Tab. 2: All possible combinations of 28 — 56 and 
35 — 7 paths that can be done individually 


in less than 3n/2 steps. 


28 — 56 


C E NE N 
(4,0,0,4) (3,4,0,0) (0,4,5,0) (0,0,4,4 


) 


C  (0,4,4,0) |(4,4,4,4) (3,44,0) ae 59,0) 

W_ (0,0,3,4) (4,0,3,4) (3,4,3,4) ne re 
SW (5,0,4,0) |(5,0,4,4)|(5,4,4,0) |(5, ss0)|040 
S  (4,4,0,0) |(4,4,0,4) (4,4,0,0) (4,4,5,0)|(4,4,4,4) 


122 


Tab. 3: All possible combinations of 14 — 28 and 
49 — 35 paths that can be done individually 


in less than 3n/2 steps. 
14 — 28 


49 > 35 ae Ce iene 
C  (0,2,2,0) | (0,2,6,2) (1,6,2,0) 
W  (0,0,1,6) | (2,0,1,6) | (0,0,6,6) ame 

4 
S  (6,2,0,0) | (6,2,0,2) | (6,2,6,2) | (6,6,0,0) 
5 


Fig. 7 shows all the paths from the source PE 28 in 
the C network to its associated destination PE 56 in the 
C network and in the eight projections. Also shown is 
the source PE 35 in the C network and its associated 
destination PE 7 in the C network and in the eight pro- 
jections. There are only four ways to go from 28 to 56 
in less than 3n/2 = 12 moves and these are shown at the 
top of Tab. 2. The four ways to go from 35 to 7 in less 
than 12 moves are shown on the side of Tab. 2. The 
four-tuple (w, x, y, z) means that the path consists of w 
Illiac 4g executions (moves), x Illiacy, executions, y Illi- 
ac_g executions, and z Ilhiac_, executions. Note that for 
the purposes here the order of execution is irrelevant. 
For example, 28 in C can go to 56 in the NE projection 
by (0, 4, 5, 0), ie. the path consists of four. Illiac+, 
moves and five Illiac_, moves. Any path between 28 in 
C and 56 in NE must include these moves. This is true 
in general, i.e., if the path from PE A to PE B is given 
as (w, x, y, z) then (1) the moves specified by the four- 
tuple will send data from A to B, and (2) any path from 
A to B must include the moves specified by the four- 
tuple. In what follows {+} will denote the generalization 
of the path from n = 8 to any n. 

Each square in Tab. 2 shows the set of moves need- 
ed to do both the 28 — 56 and 35 — 7 moves for all pos- 
sible combinations of the individual moves which can be 
done in less than 12 {3n/2} steps. The five combina- 
tions which can be done in less than 12 {3n/2} steps are 
marked by a check (,/). For example, the 28 — 56 path 
(4, 0, 0, 4) {(n/2, 0, 0, n/2)} and 35 — 7 path (0, 0, 3. 4) 
{(0, 0, (n/2)—1, n/2)} can be combined to require the 
moves (4, 0, 3, 4) {(n/2, 0, (n/2)-1, n/2)}. These two 
paths can both use the same four executions of Illiac_ , 
and so these four moves are counted only once. There 
fore, the total is 4+07+3+4=11< 12 
{n/2 + 0 + (n/2)-1 + n/2 (3n/2)-1 < 3n/2} 
moves. An example of a combination involving more 
than 12 AV wey moves is the 28 — 56 (4, 0, 0, 4) {(n/2, 
0, 0, n/2)} path and the 35 +7 (0, 4, 4, 0) {(0, n/2, 
n/2, ? path. This combination requires ne ee 
of 4+474+4=16>12 {n/2 + n/2 + n/2 + 
n/2 = 2n > 3n/2} Iliac functions. 

The analysis for the 14 — 28 and 49 — 35 transfers 
shown in Fig. 8 and Tab. 3 is similar. The five sets of Il- 
liac functions which can do both of these transfers in less 
than 12 moves are checked in Tab. 3. 

The final step to the proof is to examine all combi- 
nations of the five sets found in each of Tabs. 2 and 3 to 
see if there exists any set of transfers which can perform 
all four transfers (28 +56, 35-7, 14-— 28, and 
49 — 35) in less than 12 moves. This is shown in Tab. 
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Tab. 4: Combination of relevant paths from Tabs. 2 


and 3. 


28 — 56 
35 3+ 7 


5 
(6,2,0,2) 


4. As demonstrated, there is no such set. There are 
seven sets which require exactly 12 moves (indicated by 
checks), but none which requires less than 12. For ex- 
ample, 28 — 56 and 35 — 7 can be done using (3, 4, 4, 
0), and 14 — 28 and 49 — 35 can be done using (2, 2, 2, 
2), however, the combination of these two sets yields (3, 
4, 4, 2), which is greater than 12 moves. 

In summary, four of the moves performed by shuffle 
(28 — 56, 35 > 7, 14 — 28, and 49 — 35) have been ex- 
amined. It has been shown that no set of Illiac function 
executions can do this in less than 3n/2 = 12 moves. As 
indicated above, this argument can be generalized direct- 
ly using the substitutions listed. 

Consider an algorithm for performing the shuffle in- 
terconnection function with the Illiac network. This will 
be done by replacing each PM2I interconnection function 
in the above algorithm with [hac interconnection func- 


”% 


tions. For L2, use “Illiac,, [X™11],” since 
IIhacy, = PM245. Similarly, for L6, use ‘Tlhac;, 
| To do L5, first recall that only the even num- 


m lo 

ee PEs contain the data of concern (after L2 is exe- 
cuted and before L6 is uae Therefore, it is ac- 
ceptable to use ‘“PM2, in L5, since any data 
movement among the ae ee PEs is ignored (and 
overwritten by L6). To perform “PM2,; [X™],” for 
1<j<m, with the IIhac network the’ algorithms 
presented in [31] can be used. Specifically, to perform 
“PM24; a for 1 <j < m/2 use: 

_ fori =1 until 2) do Iliac 4) [X™] 
since 2! execution of Imliac,, is equivalent to 
+2) = PM2,;. Analogously, to perform ‘“PM24; [X™]” 
for m/2 < j < m use: 

for i =1 until 2)/n do Miac +n [X™| 

since 2)/n executions of Illiacy,, is equivalent to 
+2) = PM2 +j The total number of Illiac transfers 
needed is: 
for L2: 1 
for L6: 1 


for L5,1 <j < m/2: oe gi = 9m/2—9 = n-2 


‘ye 2) = n-1 


for L5, m/2 <j m: | 
m-l 
S/n = Sy ah) = 


j=m/2 j=m/2 


Thus, the grand total is 2n—1 transfers. As mentioned in 
Section 1, this compares favorably with the earlier result 
of 4n—4 in [25]. 


6. Conclusions 


The ability of the PM2I and Illiac single stage inter- 
connection SIMD machine networks to perform the 
shuffle interconnect was examined. In [28] is was shown 
that a lower bound on the number of transfers needed 
for the PM2I network to perform the shuffle is log.N. 
The algorithm described here and proven correct re- 
quired only (log,N)+1 transfers. This algorithm was 
used as basis for an algorithm to do the shuffle with the 
Illiac network in (2VN)-1 transfers. This compares 
favorably an earlier result of 4(WN—1) in [25]. A lower 
bound analysis was presented showing that at least 
3/N/2 transfers are required for this task. 

These results are of both theoretical and practical 
value. Theoretically, they add to the body of knowledge 
about the properties of the PM2I and Illiac networks. 
Practically, the algorithms presented could actually be 
used to perform the shuffle interconnection on a system 
that has implemented the PM2I or Ilhiac network. 
Furthermore, the lower bound proof shows that it is im- 
possible to do the shuffle with the Iliac in any fewer 
than 3/N/2 steps. 


— Acknowledgements: Some of the figures and tables in this 
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A CLASSIFICATION OF CUBE-CONNECTED NETWORKS 
WITH A SIMPLE CONTROL SCHEME 


A. Yavuz Oruc 
Electrical, Computer and Systems Engineering Department 
Rensselaer Polytechnic Institute 
Troy, New York 12181 


Abstract -- The paper presents three classes 
of cube-connected networks with individual stage 
control based on a group theoretic representation 
of interconnection networks. It is shown that 
these classes of networks have non-isomorphic 
group properties. Although permutations realiz-— 
able by such networks are rather limited in number, 
the simplicity of their control scheme makes them 
attractive for VLSI implementation. Moreover, the 
interconnection power of these networks can be en- 
hanced by simulating the networks which belong to 
one class by any member of that class. Thus it 
becomes important to derive conditions under which 
these networks become equivalent. The paper pro- 
vides a characterization for each class of networks 
from which isomorphism maps can easily be obtained. 
Methods are also presented to construct networks 
belonging to each class. 


I. Introduction 


Interconnection networks for multi-processor 
computers have drawn much attention from re- 
searchers in recent years [1,2,3]. Much of the 
previous work concentrated on proposing networks 
to implement a specific class of data routing al- 
gorithms in a parallel processing environment. Ex- 
amples of such networks can be found in [4,5,6,7, 
8,9,10,11,12,13]. The availability of several net- 
works for similar tasks prompted further studies to 
determine the interconnection power and capability 
of certain networks to simulate various other net- 
works. These fall into two categories. The first 
category deals with proving asymptotic bounds on 
the number of distinct interconnections and steps 
(passes) to realize them by a given network [14,15, 
16]. Such bounds describe the worst case behavior 
of a network and hence form a useful basis for its 
evaluation. However, they do not reveal any in- 
formation as to what class of interconnections a 
network can realize. To provide a comparative 
evaluation of interconnection networks, others in- 
vestigated the capabilities of certain networks to 
simulate various other networks [11,17,18,19,20, 
21]. These led to a class of multistage intercon- 
nection networks, here called -the edge-wise cube- 
connected networks, simply because their intercon- 
nection functions can be associated with the edges 
of a cube. We demonstrate that under individual 
stage control, edge-wise cube-connected networks © 
manifest themselves by pairwise commuting permuta-— 
tions which represent their stages. This observa- 
tion essentially paves the way for obtaining two 
new classes of cube-connected networks with the 
same control scheme. We provide characterizations 
of these classes of networks in terms of the cycle 
structures of their permutations. These indicate 
that an equivalence among two cube-connected net- 
works can be attributed to (1) the existence of a 
group isomorphism among the permutations of two 
networks and/or (2) a one-to-one onto correspond- 
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interconnection network IN, 


ence between the permutations representing the 
stages of the two networks. The paper is organ- 
ized as follows. In section II we introduce a 
network model which describes the behavior of in- 
terconnection networks under arbitrary control 
schemes. In section III cube-connected networks 
are described using this model. In section IV the 
group properties of cube-connected networks with 
individual stage control have been explored. It 
is shown that there exist at least three classes 
of such networks with non-isomorphic group proper- 
ties. In section V the characterization of each 
class of networks is provided in terms of thecycle 
structures of their permutations. In section VI 
an example is given for each class of networks. 


II. Network Model and Definitions 


In this section we introduce an interconnec-— 
tion network model and basic definitions about in- 
terconnection networks. These will serve as a 
basis for the analysis and presentation of cube- 
connected networks. To this end, we define an 
as the five tuple (S, 
D,M,F,g) where: 


(a) S and D are finite sets whose elements are 
called the source nodes and destination nodes re- 
spectively, 

(b) M is a finite set whose elements are 
called the control inputs, 

(c) F is a set of mappings f., called the 


interconnection functions such that f£.:S.>D. where 


5 and D, belong to some partitions of S and D re- 
spectively, 
(d) g, called the control function, 


jection from F to M. 


is a sur- 


For convenience, the elements of S(D) are as- 
sumed to be integers modulo |S|(|D|) where |s|(|pD|) 
is the cardinality of S(D). 

IN is called a permutation network if f. is a 
bijection for all f,eF and S.=D, for all S.e8 and 
D.eD. All the networks described in this paper are 
permutation networks unless otherwise stated. The 
cycle notation will be used to represent f, of a 
as ene network. Thus we shall aware 

=(s,S5+-- s_) to imply a, =(s potas =(s5 dE, 
di=(s )f£,. where s.eS and a: ED. k i. ae t eis 
i i 
eens is called an r-cycle, in particular, a cycle 
2 elements is called transposition, The set of all 
permutations on n symbols forms the symmetric 


group S_. A group G is a set with a binary opera- 
tion '*" on G, where '*' is associative, there is 


an identity element e in G such that e*p=p°e=p for 
all peG and for each peG, there is an inverse ele- 
ment, denoted mee in G with the property that 
p-iep=pep-l=e, The group generated by permutations 
ProeeesPy will be denoted by <p,,...,p.>. G is 
called commutative if peq=qep for all p,qeG and it 


is called cyclic if all of its elements can be (s_)f =d , s-€S., d.eD. and f. is a composition of 


generated by one element in G. the interconiection functions of IN., for all i; 
lsisr. As an example, let IN be a t5,2) network 
A network is said to generate a subgroup of where ye eee cee oe = career eae IN is shown in 
S, in k passes if the permutations realized by k Fig. 2. The control inputs are not specified al- 
consecutive applications of its interconnection though the rule described earlier applies to each 
functions over its source nodes form a subgroup stage. Hence each permutation of IN must be of 
of S_. Since every finite group is a closed set, the form f=(f,,°f,,)°(£,5°foo) 


once a network generates a group, it exhausts all 

the interconnections it can realize. It follows 

that groups play important roles in determining __0 
the interconnection power of permutation networks. 


=O 
eel 1 


_ 
| 


Control inputs are integer-valued variables. 
Each f,. is further refined by the values of the _? 2 
control input with which it is associated by eg. 
hus. 2t m=(f,)g and me{0,1} then f.(m=0) and 
f . (m=1) denote the two cycles that ft, will desig- 
nate when m=0 and m=1 respectively. The map ge co-- 2 
ordinates the composition of fi. It partitions F 
into disjoint subsets and assigns to each subset a 
unique control input. Thus if for IN=(S,D,M,F,¢), 
Bol fats, ass mz,m2e{0,1}, (f£1)g=(f2)g=m and 
(£3)g=m, then IN realizes the composition 
£,(m,=0) *f.(m,=0)°f3(m.=1) when m =0 and m=l. As 
an example, let f£;(m,;=0)=(021), f2(m;=0)=(354) and Fig. 2. An (5,2) Interconnection Network. 
f3(mo=1)=(67). The network with F={f1,f2,£3} and 
the required setting to realize 
£1 (m1, =0) * £2 (m,=0)*f3(m2=1) is depicted in Fig. 1. We note that our definition of a multistage 
interconnection network differs from that of Benes 
[22] and many others in that the links between the 
stages are not included in our model. This is due 
to the fact that we assign the same label to a 
source and destination node if and only if there 
is a path between them when all interconnection 
nodes involving them are set to the identity con- 
nections. Thus we do not need to use interconnec- 
tion functions to describe the links explicitly. 


It is of interest to compute the number N of 
distinct control schemes for an (n,r)-network. By 
counting the number of surjections from F, to M, 

: at aa 
[23], it can be shown that, 


i «GL TFG 
NS 1 1) iD (lye a ek t/|M, |! 


In particular, if IM, [=m, 2 /=P EOP “abl. Gedy ad eG 


then 
Fig. 1. Network IN Realizing f=(021) (354) (67). 
_ m, mM k,,;m, . ,P 

Note that the order of the composition is not Nett) tag (-1) G) ki} /m! 
critical since f, are pairwise disjoint cycles. 
Also note that the number of interconnections IN ee 
can realize is the number of values that m; can N=r°S(p,m) where S(p,m) denotes the Stirling num- 
take times the number of values m, can take, which ber of the second kind [23]. 
is 4. In general if the number of values which m, 
can take is n, then the aa . interconnections TII., Cube-Connected Networks 

M 

: IN 4 F 

realizable by IN ts given by tel aa In this section, interconnection networks 


with cube topology are presented. These consti- 
tute an important class of networks since their 
interconnection functions are transpositions which 
can easily be implemented by 2-input/output cross- 
bar switches, Various such networks can be found 
in [5,6,9,11,12,13,19]. In what follows, we de- 
rive the conditions to generate a subgroup of S 
through these networks in one pass under individ- 
ual stage control (ISC). 


Interconnection networks can be cascaded to- 
gether to form multistage networks. An n-input, 
r-stage network, hereafter called an (n,r)-—network, 
is a cascade of r networks IN .=(S;,D, .M,5F;58,) 
such that S, and D, are sets of integers modulo n 
and every interconnection function f of the net- 
work is defined as the composition f=f,...f_, that 
is, (s,)f=(s))f,..-£,, where (s,)£,=d,=s,,-+-; 
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_betinition 1. An (n,r) network with stages 
5 (Sy »D, ,M.,F,g) is called cube-connected if 


n= K ede some positive integer k and 

F.={f,.:j=1, ,n/2}, where f.. is either the 

identity anet ion or the eranaportation function 

from 2-subset S,. of S.to 2-subset D,. of D, for 
‘ af ij i 

all: t+ Jsisr, 


As an example, a (4,2) cube-connected network 
can be specified by the partitions P(S,)={{0,1}, 
{2,3}} and P(S,)={{0,3},{1,2}}. The elements of 
P(S. ) can be viewed as pairing the nodes of the 
2—cilbe labelled by integers modulo 4. The same 
analogy holds between an (n,r) cube-connected net- 
work and the k-cube where k=logon. In particular 
if the end nodes of the edges of the k-cube are 
labelled by integers whose binary expansions dif- 
fer in exactly one place, then the elements of 
P(S,)[P(S2)] correspond to pairing the end nodes 
of the edges (diagonals) of the 2-cube. 


IV. Cube-Connected Networks With ISC 


In this section we present the group proper- 
ties of cube-connected networks with ISC. An (n,r} 
network is said to be using ISC if IM, |=1 for all 
i; lsisr. As a rule, such networks can not gener- 
ate large subsets of S_ since the stage IN, of 
these networks can realize only two heuer 
namely, f.(m.=0)=e. and f. (m, =])= =C, owhere 

dood i 
M, ={m, }. Thus the set or pe cbrmutatichd vod 2b te 
by an *(n, r) cube-connected network with ISC in one 
pass is given by the product <c,>...<c_> where 
<c,>*<c,> {xey:xe<c.>,ye<c,>}. Clearly, the num- 
ber of _bermutations seneeated this way can not ex- 
ceed 2 Despite this shortcoming, ISC is a very 
simple SOALESL scheme and hence networks with ISC 
are appealing in at least realizing subgroups of 
small order of § We now state some conditions 
for generating such subgroups by these networks. 


Theorem 1. Let IN be an (n,r) cube-connected 
network with ISC. IN generates a subgroup of order 
2" of Sn in one pass if c, c.=c, c, for all i,j; 
l<i<j<r and C.41f<c)>. -<c,>° fo ai as  Gsiere |. 


Proof. The set of permutations G generated by 

IN is the product G=<c,>...<c_>. Let x,yeG, where 
ie VV pres ye and x. 1yGe<C4>- Then 
Bes (aa 5 ws Gass 1): ince c,*c,*c, 
x, “x, for ald 4.4%. 151. j<r, abd Hens 
xe a oo (x, ey_). But x,*y.e<c.> Thus 

(xs ry)eG, and. G, ig closed. Aldo tox aii xeG, 
tay 7 eo or x =x... ee es 
e=e,...e . Thus G is a grou To complete the 
proof, we need to show iG|=2 3. Hirst, |e|<2* since 
Gis a product of r 2-sets. The fact that |¢|=2" 
follows inductively from the observation that 


Finally 


GHSC1766-<C >, lsisr is a group of order 2° and 

for each xeG., x*c. 12%; since if x°c, pyes; pee 
= i. pS ol Se © 

C444 7% yeG, F widen contradicts the by soeneete. 


Theorem 1 provides a sufficient condition for 
the realizability of a subgroup of S_ in one pass 
by an (n,r) cube-connected network with ISC. How- 
ever, the commutativity condition among the c. of 
an (n,r) cube-connected network is a strong condi- 
tion to realize a subgroup of Sn by the network. 


As such it always leads to commutative subgroups. 
It is noted that all edge-wise cube-connected net- 
works which have appeared in [5,6,11,12,19] gener- 
ate tsomorphic copies of the commutative group of 
order 2° The following therem is based on a 
weaker Sondteton and it indicates the presence of 
non-commutative subgroups realizable by an (n,r) 
cube-connected network. 


Theorem 2. Let IN by an (n,r) cube-connected 
network with ISC. IN generates a group G in one 
pass if for every index pair i,j, where i<j there 
exists k<j such that c.=c.°c,°c.. 

to Ke 

Proof. (By induction). 

Basis. r=2. In this case, G=<c,>*<co>. 
Clearly, eat ormeare CF To prove that 
(cy°cy)7 t=cy*c ,€G, let i=l and j=2. Then by the 
hypothesis, there exists k=l<j such that 
GC ,=C2°C1*Cy OF Cy *Co=Co*Ccy. Thus 
(c,*°C2)” 1=cp°c,=C 1*CoeG and G is a group. 


Induction Step. r=i>2. Suppose 
“<C> is a group. We wish to show that 
> is also a group. Thus we must prove 
that (c., ° ; eC oa ee a 

( ity cage for each xeG Let x=x, x, 

ere < ; *X=Cn 4 Mg) a 
mn aac etre en Cay K=Ca 4, (x, x.) Now 
since itl>l by hypothesis there exists a k,<itl 
such that x,=c = 
tt Cin OF ©441°%1 ms: i+1° Thus 

Cc “xX= e e ° 

ga TK gy Rg Ry After repeating the same 
argument i-l more times we obtain 
Cay a ta a 


Now since G, is a group 


ies is ice and hence Cay 


assertion follows by the principle of induction. 


*x=y°C cGy) The 


A similar result can be stated as follows. 


' Corollary. 1. Let IN bean (n,r) cube-connect- 
ed network with ISC. IN generates a group in one 
pass if for each i>j there exists k>j such that 
c.=c.°C, *c,. 

i “j k 3 


Proof. The proof easily follows from Theorem 


The previous results amount to providing con- 
ditions for realizing subgroups of S_ in one pass 
by an (n,r) cube-connected network with ISC. It is 
also important to find conditions for which such 
networks fail to realize a group since such condi- 
tions will lead to the construction of cube-connect- 
ed networks with multi-pass features. Thenext re- 
sult establishes such a condition. It is still a 
conjecture for which a proof is under development. 


Conjecture 1. Let INbe an (n,r) cube-connect- 
ed network with ISC and c, such that c.°*c.#c.°c. 
for all i,j; 1sifj<r. IN’ can not realizeva : 
subgroup of SH in one pass. 


The previous results assert that there exist 
at least three classes of cube-connected networks 
with non-isomorphic group properties: 


(a) Networks which 
groups of S54 of order 2 

(b) Networks which generate non-commutative 
subgroups of Sn of order 2. in one pass, 

(c) Networks which fail to realize any sub- 
group of Sa in one pass. 


generate commutative sub- 
in one pass, 


In the following section the characterization 
of each class of networks is obtained. 


V. Characterization of Network Classes 


In this section we give two results to iden- 
tify the c, of a network belonging to one of the 
three classes. 


Theorem 3. Let c. and c. be produces of n/2 
pairwise disjoint evaneposieton in common, where 
n=2™ for some positive integer m. Then C.*C.=C. °C, 
if for some partition of the transpositions de t. 
into pairs (a,b,)(c,d,) there exists a partition 
of the transpositions of C. into pairs (a,c,) 6, d,). 


Proof. Suppose, 
where Py =(a, b (cy d 2 and 42 


pP.°q. me ‘pa Aewnee Soe and q. 
jéint nodes of ennai whenever if}. 


-p and c.=q) .. .q 
= ey. d c? Then’ on 
are Mada see of dis- 
Thus 


cei €.=(py'qy) «aa & In/a? But Pie G= qh. 1 and 
ec. °c. =(q, re au ) or c,*c.=c. Cc, 
Chnvuted Ly, "74 (ab) ie Ponepos tea BE C. and 
(bce) of c.. Then (a)e,*c Hc. Now suppose (ed) is 
a transposition of c, and (ad) is not of c. Then 
(a)c. c. Fc. Hence c. c. #c. Cc. and the assertion 


follows by Senn adiorion. 


Theorem 4. Let c.,c. and c, be products of 
n/2 pairwise disjoint tranesositione where n=2™ 
for some positive integer m23, and suppose that 
there is no transposition common to all of c., er 


and Che Then oF ee er if when; 
gm-2 
Cio ght habe ed.) (e 8.) Xeon) ane 
gm-2 
stl (a.c.) (bf ,) (dog,) (eh), c, has the form, 
gm-2 
sa s=1 (a8,) bh.) (c.f) (doe.). 
Proof. The condition given for c.=c.°*c, °c. 


is obviously sufficient. To prove that itJis aldo 
necessary, suppose (xy) is a transposition of c., 
such that (x)c.=w and (y)c.=z. Now suppose : 
(w)c,=z'#z. Then (x)c.= (de. xe =(w) cp? 

(z')c.#y. ey Se an Ke seed oe that 
(xy) is a transposition of Ci. Therefore the as- 
sertion must be true. 


VI. Examples 


In this section we use the characterizations 
given in the earlier section to construct three 
CCN's with ISC one from each class. Theorem 3 can 
be used to construct the networks asserted in 
Theorems 1 and 3, while Theorem 4 can be used to 
construct the networks ees in Theorem 2. 

Three (8,3)-networks, IN,, and IN, are de- 
picted in Figs. 3, 4 and 5. Rg construct IN,, its 
c, are so chosen that they satisfy the hypothesis 
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of Theorem 3, that is, c,= 
p,=(01) (23), 
q2=(46) (57). 


Pi°P2> Co=4,°d,, where 
P2=(45) (67) and q,=(02) (13), 
Similarly, ci=pi°p2 and c3=u);,°u2, 
where pi=(01)(45), p2=(23) (67) and u,=(04)(15), 
u,=(26) (37). Also, c,=p,*Py, C,=U,*u,, where 
p,=(02) (46), py=(13)(57) and u,=(04) (26), 
u,=(15) (37). Thus by Theorem 3, c,, cy and .c, 
pairwise commute and IN, generates a commutative 
subgroup of order 8, as shown in Fig. 3. On the 
other hand, IN. is constructed such that no two of 
its c, commute and hence it fails to generate a 
group as indicated by - entries in Fig. 5. Finally, 
the c, network IN, are chosen according to the hy- 
ds 
pothesis of Theorem 4 so that c,=cyc,c,y and 
p=C3°C, °C, OF C,=C,°C,°Cc,. Thus the hypothesis 
of Theorem 2 is satisfied and IN, generates the 
non-commutative group of order 8, which is shown 
in Fig. 3% 


Note that Theorems 3 and 4 can also be used 
to test if two networks belong to the same class. 
If the networks are specified topologically as in 
Figs. 3, 4 or 5, first construct the permutations 
corresponding to the stages of each network. Then 
test if the permutations of both networks have the 
same characterization in the sense of either of 
Theorems 3 and 4. For example, if both have pair- 
wise commuting permutations, it is clear that both 
belong to the class of commutative cube-connected 
networks and hence they are equivalent. The ex- 
plicit construction of equivalence maps will be 
deferred to another place. 


VIL. Conclusions 


The paper has introduced a network model and 
dealt with the classification of cube-connected 
networks with ISC using this model. It has been 
shown that there exist at least three classes of 
cube-connected networks with non-isomorphic group 


properties. The characterization of each class 
has also been provided. These results have three 
implications. First, there are cube-connected 


networks which generate all of their interconnec- 
tions in one pass. Moreover, some generate commu- 
tative while some others generate non-commutative 
groups. Therefore, a network in one class can not 
simulate a network in the other class. Finally, 
there are cube-connected networks which fail to 
realize any group in one pass. Hence, they have 
the potential to generate groups of larger order 
if they are operated under a multi-pass scheme. 

As a further research, these results can be used 
to obtain network synthesis techniques for the 
three classes of networks presented in this paper. 
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Abstract The FEM-2 parallel computer is 
being designed using methods differing from those 
ordinarily employed in parallel computer design. 
The major distinguishing aspects are: (1) a top- 
down rather than bottom-up design process, (2) the 
design considers the entire system structure in 
terms of layers of virtual machines, and (3) each 
layer of virtual machine is defined formally 
during the design process. The result is a com- 
plete hardware/software system design. The basic 
design method is discussed and the advantages of 
the method are considered. A status report on the 
FEM-2 design is included. 


Introduction 


The Finite Element Machine [1,2] is an array 
of microprocessors, originally designed as a 
special purpose parallel computer for solution of 
problems in structural analysis using finite 
element methods. The authors are currently in the 
process of designing a successor, FEM-2, aimed at 
essentially the same applications. 


Parallel Machine Design 


In most parallel machine design, the basic 
hardware decisions are fixed at an early stage of 
the design, long before the software organization 
and external environment have been considered in 
detail. This approach often leads to major prob- 
lems at later stages, where the software and 
external supporting environment must be distorted 
to match the already fixed hardware organiza- 
tion. The general approach of early decision on 
hardware, followed by later detailed software 
design is seen in the original FEM [1,2], and in 


most other designs reported in the literature, 
e.ege, Blue CHiP [3], TRAC [4], MPP [5] to name a 
few. This design approach is basically a "bottom- 


up" approach. 


This work supported in part by NASA Contracts 
NAS1-17130 and NAS1-17070 while the authors were 
in residence at ICASE. The first author was also 
supported in part by NSF Grant MCS/8-00763. 
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layers of virtual machine. 


operations, 


In the FEM-2 design, an alternative "top- 
down" approach has been adopted. While the use of 
a top-down approach to system design is not novel, 
the particular form this has taken in the FEM-2 


design is novel, in the context of parallel 
computer design. Two aspects are of note: 
a. FEM-2 is considered to be composed of 


Each layer defines the 


view of the system available to one class of 
users. Four layers of virtual machine are 
currently conceived: (1) The applications user’s 
machine (e.g., as defined by the interactive 
command language), (2) the applications 
programmer/numerical analyst’s machine (e.g., as 
defined by the applications language), (3) the 


systems programmer’s machine (e.g., as defined by 
the operating system structure), and (4) the 
hardware itself (which if microprogrammed may 
include another layer of virtual machine). 


b. Each layer of virtual machine is formally 
specified during the design process, using the 
methods of H-graph semantics [6] to construct a 
formal model of each layer. The advantages of 
this formal specification are explained below. 


A virtual machine is composed of~(1) various 
types of data objects, (2) various operations on 
those data objects, (3) various sequence control 
mechanisms for specifying the order of the 
(4) various data control mechanisms 
for controlling access to data objects by the 
operations, and (5) storage management mechanisms 
for determining the placement and movement of data 
and code during program execution. 


The FEM-2 Virtual Machines 


Although complete virtual machine 
descriptions cannot be given here, a brief sketch 
will indicate the general type of results from 
this design approach. Considering each of the 
four levels of virtual machine, some typical data 
objects, operations, control mechanisms, and 
storage management methods are listed below. 


Application User’s Virtual Machine 


The FEM-2 user would typically be a struc- 
tural engineer using the system as an interactive 
workstation that allows him to store the descrip- 
tion of a structural model, to invoke applications 
packages to analyze the model, and to display the 
results. The following is a partial list of the 
virtual machine components at this level. 


Data objects: 


Structure/substructure model 
Grid description 
Node/element description 
Load set 

Displacements of nodes 
Stresses on elements 


Operations: 
Define structure model 
Generate grid 
Define elements 
Solve structure model/load set for displacements 
Calculate stresses 
Data base operations (store model in DB/retrieve) 


Sequence control: 


Direct interpretation of user commands 


Data control: 
Workspace (user local data) 
Data base (long-term storage; shared data) 


Storage management: 


Dynamic storage allocation for models, results, 
workspaces, etc. 
Data movement between data base and workspace 


Numerical Analyst’s Virtual Machine 


The numerical analyst is a research user who 
views the machine in terms of a high-level lan- 
guage that allows him to specify directly the data 
structures, operations and their sequences, and 
the parallelism in the linear algebra necessary to 
implement efficiently a structural engineer’s 
application. We assume as a base a _ sequential 
language such as Fortran, Pascal, or Ada, and only 
mention some of the new constructs needed for 
effective control of the parallel processing and 
data distribution in the parallel systen. 


Data objects: 


Windows on arrays (e.ge, row, column, block 
descriptors, for remote access to non-local 
data) 


Operations: 
Tasks (programmer-defined parallel procedures) 
Window operations: create window, access/assign 
data visible in a window 
Broadcast data to a set of tasks 
Linear algebra operations: inner product, vector 
operations, etc. 


Sequence control: 


Forall loops -- do all iterations in parallel if 
possible 


Pardo...end -- do all statements in parallel 

Task control: initiate a task, pause, resume a 
paused task, terminate 

Remote procedure call - location determined by 
location of data visible in a window 
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Data control: 
All data owned by a single task 
Data accessible non-locally only via windows 
Windows may be transmitted as parameters, further 
partitioned, stored as values of variables, etc. 
Tasks may communicate through windows 


Storage management : 


Dynamic creation of data objects by a task 

Data lifetime = lifetime of owner task 

Dynamic creation of multiple task replications 
Local data of a task retained over pause/resume 


System Programmer’s Virtual Machine 


By specifying the run-time representation of 
tasks, their scheduling, the communication between 
them, and the storage representation of the data, 
the system programmer’s virtual machine is used to 
implement the numerical analyst’s virtual 
machine. The following is a partial list of the 
virtual machine components. 


Data objects: 

Code blocks/constants blocks 
Task/procedure activation records (local data) 
Window descriptors 
Storage representations for scalars, arrays, etc. 
Messages from tasks: 

initiate K replications of a task of type T 

pause and notify parent task 

resume a child task 

terminate and notify parent 

remote procedure call 

remote procedure return 

load code/constants 


Operations: 
Usual sequential 
call, etc. 
Library routines for linear algebra operations 
Format and send message (one of the 7 types above) 
Decode and execute message (e.g., an initiate task 
message may require the following steps: find 
code for task, allocate an activation record, 
copy parameters from the message queue into 
activation record, enter task in ready queue) 


Sequence control: 


Usual sequential language control structures 


operations: arithmetic, procedure 


Data control: 
Usual sequential language structures 


Storage management: 
General heap with variable size blocks 


Hardware architecture 


The requirements imposed by the upper levels 
of virtual machine suggest that the architecture 
should be chosen to effectively support: 

Large scale dynamic task initiation 

Remote access to local data (through windows) 

Large messages (between tasks, and from a task 
to the operating system) 

Irregular communication patterns 

Large storage requirements; dynamic allocation 

Fast linear algebra operations (to extract the 
low-level parallelism available in these 
operations) 


In addition, several additional requirements 

are imposed independently: 

Use off-the-shelf hardware/software if possible 

Provide a way to extend the system to larger 
configurations easily 

Provide reconfigurability to isolate faulty 
hardware components 

Provide multi-user access 


From these requirements, an architecture is 
evolving that is configured as clusters of 
processing elements organized around a_= shared 
memory. Sets of clusters communicate through a 
common communication network. Within each 
cluster, one PE runs the operating system kernel, 
which fields incoming messages and assigns 
available PE’s to process them. Messages arriving 
in the input queue of any cluster can be processed 
by any available PE. Since this architecture will 
be described at length in other papers, no 
detailed design is given here. 


Formal Specification of Virtual Machines 


By formally specifying the data objects, 
operations on those data objects, control 
mechanisms, and storage management techniques of 
each virtual machine level, a detailed 
software/hardware design can be obtained that 
specifies the function of each level as well as 
its implementation on the next lower lever. Our 
research uses the methods of H-graph semantics [6] 
for making this formal specification. H~graph 
semantics is a mathematical modeling method for 
software/hardware systems that can be used to 
construct a precise mathematical model of each 


virtual machine level. The data objects are 
modeled as hierarchies of directed graphs (H- 
graphs) in which the nodes represent abstract 


storage locations and the arcs represent access 
paths. Data types are modeled using formal "H- 
graph grammars," a type of BNF grammar in which 
the "language" defined is a set of H-graphs 
representing a class of data objects. Operations 
(procedures) on the data objects are modeled as 
"H-graph transforms," which are functions defining 
transformations on the H-graph models of data 
objects. H-graph transforms may invoke each other 
in the usual manner of subprogram calling 
hierarchies to determine the overall flow of 
control in a model of a virtual machine. 


In the FEM-2 design process, 
virtual machine is designed first, 


each layer of 
starting with 


the top layer and considering each layer as 
defining the requirements that must be satisfied 
by the design at the level below. Several 
iterations through the four levels are made, 


adjusting the design to find an appropriate mix of 
hardware and software at each level. As the 
design begins to "firm up", the individual virtual 
machines are defined formally. The precise formal 


definitions are then used as the basis for 
simulations of the various virtual machine 
levels. Simulations to measure the storage, 
processing, and communication patterns in typical 


FEM-2 applications and to determine the ease of 
programming the machine at the various levels are 
of particular importance. The ultimate result is 
to be a detailed design of the hardware and 
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software, 


completely specified at each level in 
terms of its function and its implementation on 
the next lower level of virtual machine. 


Conclusion 


A major advantage of the top-down, layers of 
virtual machine, design approach is that it forces 
a design of the entire system structure, including 
I/O (virtual) devices, global control strategies, 


interfaces with the outside environment, etc. at 
an early design stage. It also allows the 
potential parallelism at various levels to be 


considered in detail: parallelism in user requests 
for simultaneous solution of several independent 
problems, parallelism in the substructure analysis 
of a larger structure, parallelism in the finer 
structure of solution of a particular system of 
simultaneous equations, etc. A third advantage is 
that the entire design process may be iterated, 
adjusting the design of each virtual machine 
level, until the proper match of hardware and 
software organizations is found. 


Current Status 


The FEM-2 design effort has been underway 


since December 1982. At present the first 
iteration of the design of the four layers of 
virtual machine is nearing completion. Several 
scenarios of use of the numerical analyst’s 


virtual machine have been carried out in detail, 
using a detailed design of a typical algorithm to 
get quantitative estimates of processing 
requirements, storage requirements, and 
communication requirements for a typical large- 
scale application. One such analysis is reported 
in [7]. H-graph semantics definitions of the 
various levels are being constructed. 
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abstract 

Recent advances of VLSI technologies have 
made multi-microprocessor systems feasible to 
construct. This paper presents a multi- 


microprocessor system for a LISP-based concurrent 


programming language, Concurrent LISP. Concurrent 
LISP is designed for user oriented concurrent 
programs, especially for artificial intelligence 
programs. The authors had developed Concurrent 


LISP on single processor systems. The multi-micro- 
processor system proposed here is constructed on 
the basis of these experiences. The multi- 
microprocessor system is constructed using general 
purpose microprocessors and it has the language 
oriented system configuration. 

The multi-microprocessor system presented has 
the nine processor elements and the large common 
memory area. Reflecting the types of the data to 
be stored and their access mechanisms, each 
processor element has the specialized memory 
interface circuits, and the common area is 
separated into three sub-areas. The system 
software is distributed to all the processor 
elements and has the hierarchical configuration. 
The system software, especially the operating 
system, is simplified well to reduce the system 
overhead. 


1 Introduction 


This paper describes a 
system which has 


multi-microprocessor 
specialized memory interface 
circuits for list processing and multiprocessing. 
We have developed a LISP-based concurrent 
programming language, called Concurrent LISP (C- 
LISP) [5]. C-LISP has been developed to make use 
of multi-process description mechanism for the 
artificial intelligence problems instead of 
conventional description mechanisms such as 
backtracking and coroutines. C-LISP is user 
oriented, and it has simple yet flexible 
facilities to describe concurrent processes. We 
had developed the C-LISP interpreter on a large 
scale computer (FACOM M-200) [5] and on an MC68000 
system [3]. Based on the experiences on these 
interpreters, we are at work on the development of 
a C-LISP machine composed of multiple 16-bit 
microprocessors (nine MC68000's) and large 
common memory area (8 MB) [4]. 

C-LISP has flexible facilities to describe 
explicit parallelism. A typical example program is 
a multi-process search program for a game problem: 
multiple cooperative processes search their own 


a 
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paths in the search space for the game. During the 
execution of C-LISP programs, many concurrently 
executable processes are normally created. C-—LISP 
programs need large computation power, as most of 
LISP programs need much computation capacity 
rather than I/O capacity. Multi-processor system 
configuration is fruitful for C-LISP, since in 
such configuration every processor element will be 
utilized well by C-LISP processes. 

The multi-microprocessor system presented is 
an MIMD type system consisting of two different 
types of processor elements and a_ very large 
common memory area. C-LISP programs are stored on 
the common memory area and executed by the 
processor elements in parallel. The two types of 
processor elements are Master Processor (MP) for 
the management of the whole system and Interpreter 
Processors (IP's) for interpretation of C-LISP 
programs. The system monitor is distributed to MP 
and IP's, Each IP has the interpreter which is 
controlled by the system monitor. All IP's have 
the same program and have no data of processes 
except certain portion of processes. 

We paid our attentions to two key problems 
for designing our system. One problem is to make 
the processor elements be dedicated to 
interpretation of C-LISP programs. Since many 
concurrent processes are normally created, all the 
processor elements are fully utilized by the 
processes. The other problem is the well balanced 
design of the system software and hardware. On the 
general purpose system, the software bridges the 
gap between the speciality of the applications and 
the generality of the system, Therefore, the 


overhead of the software is usually heavy. On the 


other hand, we may get powerful C-LISP system if 
we can construct the machine using special 
hardware architecture or firmware. However, it is 
expensive and time consuming to construct’ such 


specialized systems. We designed this system using 
general purpose microprocessors with small scale 
additional hardware. 


2 Overview of the System 


2.1 Concurrent LISP 


Concurrent LISP is a concurrent programming 
language based on LISP 1.5. C-LISP completely 
includes LISP 1.5 as its sequential part. C-LISP 
has simple yet powerful concurrent functions’ to 
create a process explicitly and to write inter- 
process communication. The definition of a process 
of C-LISP is: 


"A process is a self-contained entity which 
evaluates a given form." 

Processes are activated at top-level and at 
STARTEVAL functions. The process activated at top- 
level is called the main process, and other 
processes activated at STARTEVAL functions are 
called sub-processes. A process which creates a 
new process is called the parent process of the 
created process. On the contrary, the created 
process is called the son process. Processes have 
properties such as identifiers, relationships 
among processes, status, mailboxes, evaluation 
results and so on. 

The concurrent 
primitive concurrent functions, 
STARTEVAL for process activation, 
interprocess communication, and also includes the 
basic functions for manipulation of process 
properties. All of these functions are designed to 
be fruitful for the language feature of LISP. 

This definition allows users to use processes 
as program components in their programs like 
variables and procedures. Since C-LISP is designed 
for writing problem solving programs which require 
flexibility of both control and data structure, C- 
LISP is useful not only for writing application 
programs in itself but also for constructing 
application oriented languages on it. 

The primitive concurrent functions STARTEVAL, 
CR and CCR are presented below. The language 
feature is described in detail in the references 


[4][5]. 


*% 


* starteval[proc;;proco;-- 


include three 
which are 
CR and CCR for 


functions 


‘$procy] 


proc; = list[name 4; form; ;sharedj; ], 
name; = name of i'th son process, 
formj = form to be evaluated by i'th son 
process. 
shared; = shared variables available for i'th 
son process. 
When a_ process executes STARTEVAL, the process 
may activate n son processes. Bach son process 
has its own name specified by name and evaluates 
form. The value of STARTEVAL is a list of names 


of son processes; 
starteval|proci;proco3-+++3procy] 
list[name;;nameo; ++- 


* crl var; form] 
A process evaluates form with the exclusive right 
to access the variable var. During evaluation of 
form, the process keeps the right. The value of 
cr[var;form] is the value of form; 

cr[var;form] = form. 

* ccr[var;condition;form] 

A process waits until 


snamey]. 


condition is neither NIL 
nor F, and evaluates form with the exclusive 
right to access the variable var shared among 
processes. The value of ccr[var;condition;form] 
is the value of form; 

ccr[var;condition;form] = form. 

C-LISP processes communicate with each other 
via shared objects. These two primitives guarantee 
mutual exclusion for the shared objects among the 
processes. (The shared objects are variables 
shared among processes, certain properties on 
property lists, and mailboxes of processes.) 

The following example program shows a program 
for Fibonacci number. This program is not a 
typical one but includes several concurrent 
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functions in a few lines. This function also shows 
processes can be activated recursively. 


(FIBONACCI (LAMBDA (N) 
(COND ((LESSP N 2) 1) 
ai 


((LAMBDA (X) 
(PLUS (FIBONACCI (SUB1 N)) 
(CCR X (TERMP X) (PROCVAL X)))) 
(CAR 
(STARTEVAL 
((GENSYM) (FIBONACCI (SUB2 N)) NIL))))) 
Ee, 


The meaning of this function is as follows: 

Tf n<2 then fibonacci[n]=1. 

Otherwise, create anew process which executes 
fibonacci[n-2]. The new process is given a name 
generated by gensym[] and no shared variables in 
the initial environments. The creating process 
computes fibonacci[n-1] by itself. The creating 
process waits until the created . process 
terminates. (termp[x] becomes true if process x 


has terminated.) The creating process gets’ the 
result of the created process by procval[x]. The 
creating process adds these values and returns 


it. 


2.2 Overview of the System Configuration 


2.2.1 C-LISP Interpreter 


Fig.l shows the overview of the configuration 
of the interpreter on single processor systems 
ige'a eles The interpreter has two program modules, 
the schedule module and the interpret module. The 
former manages all processes, i.e., management of 
process activation, process switching and process 
termination. The latter interprets given C-LISP 
programs under the control of the former. For 
quick process switching, the interpret module is 
designed to load no private data of processes in 
itself. We call this feature "transparency" of the 
interpret module, Each process has its own private 


data on the process control block (PCB), the 
control stack, and the environment realized using 
association list (A-list) method. For the 
realization of the multiple control stacks and 
environments, linked structure is utilized well in 
the interpreter. Though continuous memory 
allocation is usually more efficient in both 


Process Management 
Interpret 
Module 


CS1 CS2 CS3 
Control Stacks 


Schedule 
Module 


Arrays, 
Strings, etc. 


A-lists 


Fig.l Overview of the C-LISP Interpreter 


aspects of access speed and memory size than the 
linked structure, it is difficult to arrange the 
data of multiple processes into continuous memory 
space. The performance of the interpreters already 


developed is not so high because of the memory 
management task for multiple processes. The 
interface circuits described in this paper are 


designed to solve this problem. 

Based = on the configuration of the 
interpreter, we determined the basic design of the 
multi-microprocessor system, The followings are 
the basic concepts for the design. 

1. C-LISP processes should be 

common space. 

2. Processors which interpret programs should be 
transparent for processes, i.e., C-LISP engine. 
Consequently, followings are the basic problems 

which must be solved. 

1. The bus bottleneck problem must be solved to 
connect considerable number of processors to 
the common bus. 

2. The access methods for specialized memory 
areas should be reflected on the hardware 
configuration to improve the access speed. 

3. Each processor should determine its tasks by 
itself to reduce overhead for the processor- 
processor interaction. 


loaded on one 


2.2.2 The Hardware Configuration 


The multi-microprocessor system consists of 
nine 16-bit microprocessors, MC68000 (M68K), and a 


large common memory area (8MB). The processor 
called Master Processor (MP) manages the entire 
systen, and other eight processors called 
Interpreter Processors (IP's) interpret C-LISP 
programs under the control of MP  (Fig.2). 
According to the access method and = access 


frequency, the common memory is separated into 
three parts, each of which has’ the independent 
common bus, to avoid the bus bottleneck. 

Each processor element consists of the 
processor part and interface part. The former is 
designed as a_ general purpose single board 
computer with one M68K (8 MHz), RAM (256 KB), ROM 
(2/4 KB), one communication port, and IEEE 796 bus 
interface. Each processor has its own programs on 
the local memory, ey oer the system monitor 
functions, the garbage collectors, and interpreter 
functions. The latter includes intelligent 
interface circuits to the common memory areas, and 
interrupt interface circuits between processor 
elements. The interface circuits play the very 


Fig.2 Overview of the System Configuration 
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important role in this system because they bridge 
the gap between the specialized information 
structure of C-LISP and the general purpose 
microprocessor. 

The three common memory areas are as follows. 

1. Control Stack area: Control stacks of all 
processes are stored. The control stacks contain 
control information of processes such as return 
address and temporary variables. The stack area 
is divided into 1 KB blocks. Each process has 
logically continuous control stack space which 
is composed of one or more physical blocks. 

2. List Cell area: List cells are stored. This 
area is designed to have 1 Mega Cells. Each cell 
has 48 bits, 20 bits each for CAR and CDR, and 
8 bits for attributes. 

3. PCB and Random Access (PCB & RA) area: Non 
list data, such as character strings and arrays, 
are stored. C-LISP system also uses this area as 
working space for system management. 

In the case of a multi-M68K system with one 
common bus, no more than two processors can be 
connected to the bus, since 62.5 4 of one machine 


cycle is necessary for memory access. In our 
system, all M68K programs are stored on local 
memory, whose access time is shorter than the 


common memory, to decrease the access frequency to 
the common area to half or less. 

For inter-processor communication, this 
system has the interrupt signal lines between MP 
and IP's, i.e., star-connection configuration 
whose center is MP. The interrupt lines are used 
for synchronization of the processors, and _ the 
communication messages are put on the interrupt 
message buffers on the PCB & RA area. The usage of 


the interrupt lines are restricted to. several 
purposes, which are described in the _ later 
section, because of the overhead for the 


synchronization, 
2.2.3 The Software Configuration 


The software which works on the multi-micro- 
processor system is composed of 
1) User Programs, 
2) Interpreter, 
3) Garbage Collector, and 


C-LISP Interpreter 
Processes 

Garbage System 
Collector Monitor 


Common 


Fig.3 Layered Configuration of the Software 


4) System Monitor. 
Fig.3 shows the layered configuration of the 
software and the relationships bewtween the 
software components and hardware components. 
1) User Programs 
User programs are put on the list cell area. 
During their execution, the interpreter creates 
and puts information necessary to execute user 
processes on the common memory, i.e., association 
lists and property lists on the list cell area, 
control stacks on the stack area, and arrays, 
strings, large numbers and process control blocks 
on the PCB & RA area. 
2) Interpreter 
The interpreter 
programs on 


on 
the common 


every IP executes users' 
memory. The interpreter 


should not possess data of user processes on the 
local memory for quick process switching. This 
feature is called "transparency" of the 
interpreter. 
3) Garbage Collector 

This system has the garbage collectors for 


every common memory area. The garbage collectors 
are invoked by the events indicating shortage or 
exhaustion of memory cells. The PCB & RA area 
garbage collector squeezes garbages out of 
allocated area, and reclaims free area. The stack 
area garbage collector is invoked by exhaustion of 
the stack blocks, and finds the garbage blocks 
among the allocated ones. The list cell area 
garbage collector is invoked by the FCP interrupt 
which indicates that the exhaustion of list cells 
will come soon. 

List cell area garbage collection is designed 
to be performed by all processors in parallel. We 
use a modified mark-and-collect algorithm for our 
system. The garbage collection is performed in two 
phases: in the first phase, each IP marks active 
cells from the roots of the processes allocated to 
itself, and in the second phase, each IP collects 
unused cells in its allocated portion, which is an 
eighth part of the whole area. MP arranges the 
synchronization of these activities at each phase, 
and restores the collected cells to FCP. All IP's 
execute their tasks in parallel under the control 
of MP. In both phases, the load of the tasks is 
distributed to all processor elements, so that the 
response time of the garbage collection is 
improved. Since no bad effect is caused by 
overwriting marks on the same cells, IP's need not 
access cells exclusively for marking and may mark 


the same cells twice or more. As the exclusive 
access usually takes a long time, this is an 
advantageous feature. The on-the-fly garbage 
collection algorithm [2] is not introduced, 
because it obliges IP's to load no roots for 
marking at all. Such complete transparency is 

considered harmful for the system performance. 

4) System Monitor 

The functions of the system monitor (or the 


operating system) are distributed to MP and IP's. 
MP portion mainly performs housekeeping tasks, and 
IP portion performs monitoring of user processes. 
The MP portion manages state transition of waiting 
and suspended processes, receives requests from 
IP's and IOP (1/0 Processor), and executes the 
requested functions. On the other hand, the IP 
portion selects a process and executes. it. 
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Communication between these portions, i.e., intra- 
OS communication, is performed via _ either 
interrupt interfaces between MP and IP's or 
message buffers on PCB & RA area. The key problem 
for the design of the system monitor is to let 
IP's work as freely as possible. Therefore, the 
direct intra-OS communication via the interrupt 
interfaces should be restricted only to real time 
communication to stop or to suspend execution of 
running processes. The strategies to manage 
processes and processors are presented in another 
section. 


3 Intelligent Interface Circuits 


The multi-microprocessor system is composed 
of general purpose microprocessors. Though memory 
area is separated to avoid bus bottleneck, we must 
provide specialized interface circuits between the 
processors’ and the common memory components 
because objects stored in the common memory have 
specialized data structure and access mechanism 
which may be different from those of the 
conventional microprocessor. 


3,1 List Cell Interface 
3.1.1 Basic Idea 


The followings are the basic requirements for 
the C-LISP machine for efficient list cell access. 
Address Translation: Since a list cell usually 
has three portions in it, CAR, CDR and attributes, 
the length of a cell is rather longer than the bus 
width of processors. A list cell should be 
accessed using cell address to decrease the 
overhead for address translation by processors, 

Quick List Read/Write: Quick list read/write 
operation is indispensable for quick access to a 
variable on an association list (A-list) and fast 
list manipulation. For quick list manipulation, 
the overlapped list read operation, i.e., list 
pre-fetching, will work well, since the next cell 
to be operated may be found on the interface 
registers. 

Free Cell Pointer Circuit (FCP): The pointer to 
the top of free cell list must be accessed ex- 
clusively. Since heavy overhead is inherent in the 
arbitration of the pointer by software, we should 


provide special purpose circuit, which always 
possesses the current top of the free cell list 
and automatically updates it to avoid the 


overhead. 

Quick list cell manipulation is important for 
the C-LISP system to reduce the system overhead 
for the memory allocation problem. In sequential 
LISP systems, continuous data structures are used 
for efficiency, e.g., CDR-coding [1], shallow 
binding and deep binding implemented using stacks. 


However, to manage continuous structures is 
difficult in the case of multiple 
process/processor systems. For example, our system 
uses A-list method to make environments of 
processes. Though A-list method is said 
inefficient compared with other sophisticated 
methods, it is considered more efficient to put 
multiple environments on the common area.. 
Moreover, if we can get quick list access 


facilities, we can make other components of the 
interpreter in the form of lists because of the 
flexibility of list structure. 


3.1.2 the List Cell Interface Circuit 


The cell interface circuit has 16 IFR's, the 
sequence control logic, and bus interface logic. 
Fig.4 shows the concepts of the overlapped 
operation and the configuration of the interface 


circuits. The width of an JIFR is 24 bits 
consisting of 20 bits cell address (i.e., up to l 
MCell) and 4 bits attributes. Since one cell 


consists of 48 bits, two IFR's are occupied by one 
cell. The IFR's are composed of high speed TTL 
memory chips. The cell interface command is 
encoded as an absolute address in the M68K's 
memory space. Fig.5 shows the instruction schema. 
The cell command has’ two attributes and four IFR 
fields. The first IFR field, i.e., to-CPU field, 
specifies the IFR to/from which M68K transfers 
data. The second field, i.e., Cell-Address field, 
specifies the IFR which contains the cell address 


Overlapped 
' Operation. 


Local Bus Common Bus 


a. Concepts of the Overlapped Operation 


IFR's 


24bits x 16 


to be accessed. The third and forth fields, i.e., 
CAR and CDR fields, specify IFR's to/from which 
CAR and CDR data are transfered from/to the list 
cell area. The R/W field specifies the direction 
of data transfer between IFR's and the list cell 
area. (R/W=R means that data is read from the list 
cell area, and R/W=W means’ the reversed 
direction.) The M/N field specifies that the 
attributes of a half cell are masked at the data 
bus buffer of the local bus. (M/N=M means "mask", 
and M/N=N means "no-mask".) The following example 
is an M68K's move instruction used for data 
transfer on the list cell interface circuit. 
MOVE.L [N,R,1,2,3,4],destination 

The meaning of this instruction is: 

The contents of IFR1] is moved to the destination, 
and CAR and CDR data of the cell specified by 
bit 

23 - 20,19,18,17-14,13 - 10,9-6,5-2, 1,0 


Sel. cell{H/N[R/W] to-cPU|celi-Addr. [CAR[CDR | 00] 


23-20 -—- Select Cell Area 

19 - Mask or Non-mask 

18 ~ Read or Write 

17-14 - To-CPU Register Field 

13-10 - Cell Address Register Field 

9 - 6 - Car Register Field 

5 - 2 - Cdr Register Field 

1 - 0 - Always zero for long word operation 
* If zero, no data is transfered between 

IFR's and cells. 

If zero, Test-and-Set operation is executed 

on the target cell. 


AX 


Fig.5 Instruction Schema of the List Cell I/F 


1.Interrupt Address Buffer 
2.Interrupt Signaling Logic 
3.Local Output Buffer 

4.,Local Input Buffer 

5.Local Bus Control Logic 
6.Local Address buffer 
7.Register Address Multiplexer 
8.Zero Detecter 

9.Common Address Buffer 


10.Common Output Buffer 


cont. signals 


Local Bus 


11.Common Input Buffer 
12.Sequence Control Logic 


13.Common Bus Control Logic 


Note: 

*U/L=Upper word/Lower word 
*xBoth buses satisfy IEEE 796 
bus specification. 

*A=Addr.,D=Data, C=Command 

Common Bus | 


b. Block Diagram of the List Cell Interface Circuit 


Fig.4 The List Cell Interface Circuit 
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IFR2 are read and moved to IFR3 and IFR4 
respectively. 


The interface register 0 (IFRO) has’ the special 


role. If to-CPU field contains zero, it means 
Test-and-Set operation is executed on the cell 
specified by the Cell-Address field. If other 


fields contain zeroes, it means those fields are 
not used in the operation. In Fig.6, several 
instructions are presented. In those instructions 
except the Test-and-Set instrusction, the 


Operation on the common bus starts just after the 
operation on the local bus has finished. Thus, 
this interface manages the overlapped operation of 
data transfer on the local bus and the common bus. 

This interface circuit has the update-inter- 
rupt facility which is provided to detect an 
update event on a shared variable. The address of 
the updated cell is passed to M68K. The system 
monitor receives this event and executes’ the 
relevant tasks. 


3.1.3 the Free Cell Pointer Circuit 


This system is designed to have only one free 
cell list in the common cell memory. The exclusive 
access to the top of free cell list should be 
maintained to deliver a new free cell to each 
processor consistently. We provide the FCP circuit 
for quick CONS operation. It is the simplified 
circuit which has the register holding the next 
free cell address, the register holding the 
remaining free cell number and the interrupt 
logic. (Fig.7 shows the configuration.) The former 
register is called the free cell pointer register 
(FCPR) and the latter register is called the free 


cell counter (FCC). At each time a processor 
element reads FCPR to get a new cell, FCP 
automatically updates FCPR just after the read 


operation. This schema is guaranteed by giving the 
highest priority of the list cell bus to FCP. FCC 
and the interrupt logic are provided for 
initiating the list cell garbage collection. FCP 


decrements FCC whenever FCPR is read. When FCC 
indicates that the remaining cells are less than 
certain amount, FCP interrupts MP to request 


Cell 


M68K IFRx A.B. 


(A.B. = Addr. Buf.) 
MOVE.L [N,R,1,2,3,4],D0 


IFRL DO 

IFR2 Addr. Buf. 
Cell IFR3 - Car 
Cell IFR4 - Cdr 


MOVE.L DO,[N,W,1,2,1,0] 


DO IFR1L 
LFR2 Addr. Buf. 
IFR1L Cell - Car 


IFRL Addr. Buf. 
Cell 


DO 


time Notes: IFRx means Interface Register x. 
"-" means "don't care". 


Fig.6 Flow of Data on the List Cell Interface 


MOVE.L [N,-,0,1,-,-],D0 


Cell - Test&Set 


garbage collection. Exhaustive use of free cells 
is inhibited to guarantee correct response for 
read FCPR operation. FCP activities are summarized 
as follows: 

1, arbitration for exclusive access to the top of 

the free cell list, 

2, automatic update of the top of the free cell 

list, and 

3. detection of shortage of the free cells. 
These activities are simple but heavy if executed 
by software. Since CONS operation and the list 
cell garbage collection are primitive operations 
of LISP and CONS is executed frequently, FCP is 
indispensable for our system. 


3.2 Control Stack Interface 


3.2.1 Basic Idea 


Since the access frequency to control stack 
is quite high, for example about 1/5 of the whole 
memory access in the C-LISP interpreter on the 


single M68K system [3], efficient access mechanism 
is indispensable for this system. 

The stack area is divided into 1 KB blocks, 
which are allocated to processes block by block. 
In the interpreter on the single M68K, since this 
memory management is performed by software, it has 
quite heavy overhead to test illegal access to 
outside area of allocated blocks. Therefore, we 
need memory management facilities on the processor 
elements to provide logically continuous space for 
each process. In addition to the memory management 
problem, local buffer memory should be provided on 
the processor elements to accelerate the access 
speed to the control stacks. 


3.2.2 the Control Stack Interface Circuit 


The stack interface circuit consists of three 
major portions; the limit registers, the buffer 
memory (4 KB), and the DMA control logic. The 
limit registers specify the upper and lower 
boundaries of the portion of the control stack 
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Fig.7 Free Cell Pointer Circuit 
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loaded on the buffer. The DMA controller transfers 
blocks between the buffer and the’ stack area. 
Fig.8 shows the concepts of the operation and the 
configuration of the stack interface circuit. 

Fach process is given its logically 
continuous stack space. The interpreter accesses 
to a control stack of a process using the logical 
address. The accessed location is always found on 
the stack buffer. This is guaranteed by the guard 
areas located at the both ends of the loaded 
portion on the buffer. (When either of the guard 
areas is accessed, the system monitor makes the 
next block available on the buffer.) When the 
system monitor transfers a_ block from the buffer 
to the common memory, it allocates a physical 
block to a logical block. Consequently, the 
interpreter is freed from the heavy overhead for 
stack manipulation. We use this pseudo-page-fault 
manipulation mechanisn, since M68K has no page- 
fault facilities. Our method has the restriction 


that the processor cannot access distant location 
from the location accessed currently. (Since we 
assume the width of the guard area is 256 bytes, 
the processor cannot access the location whose 
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offset from the location 
more than 255 bytes.) However, this restriction 
has little effect on the software, since the 
control stack space has the very strict locality. 


accessed currently is 


As described above, the stack interface 
circuit has two major functions, i.e., memory 
management for the logical stack spaces of 


processes and buffering of the active portion of 
the stacks for the common memory. The former 
feature is indispensable to realize the multiple 
process environments, i.e., virtualization of the 
users' memory spaces. The latter has the buffering 
effect but it disrupts the transparency of IP's. 
By this disruption, process switching overhead 
becomes rather heavy to swap out/in_ blocks. 
However, if processor elements have no buffer on 
the interface, the low access speed to the stack 
area will cause severe effect on the system 
performance since access frequency of the stack 
area is quite high. In addition, the process 
switching overhead is negligible, because it needs 
about 2 ms to switch processes while switching 
interval is designed to be long. (Notes: DMA 
controller consumes 0.5 ms to transfer one block, 
and four blocks are transfered in the average for 
each process switching. LISP interpreter usually 
consumes long CPU time, so that we assume one or 
more seconds for the interval timer which triggers 
process switching.) 


3.3 Example Procedures 


Fig.9 presents an APPEND and a — SASSOC 
procedure, which are typical procedures for list 
manipulation. These procedures utilize the 
interface circuits well. 


4 System Monitor 
distributed 


This system ‘has the system 


monitor. The tasks of the monitor are process 
management, processor management, and memory 
management. I/O management, which is one of the 


major tasks of operating systems, is executed by 


1.Local Bus Control Logic 
2.Lower Limit Register 
3.Upper Limit Register 
4.Address Comparator 
5.Interrupt Vector Register 
6.Interrupt Control Logic 
7.Buffer Block Addr. Reg. 
8.Common Block Addr. Reg. 
9,DMA Counter/Addr. Reg. 
10.DMA Controller 


11.Data Buffer 


12.Address Buffer 
C 
Common Bus 13.Common Bus Control Logic 


b. Block Diagram of the Stack Interface Circuit 


Fig.8 The Stack Interface Circuit 


the I/O processor (IOP), so that it is excluded in 
this paper. 


4.1 State Transition of C-LISP Processes 


Fig.l10 shows the state transition diagram of 
C-LISP processes. All processes are monitored and 
moved from state to state by the system monitor. 
To move a process from the waiting state to the 
ready state, the system monitor must test the 
waiting condition written in C-LISP: the system 
monitor evaluates the second parameter of CCR by 
itself. Therefore, the state transition diagram 
from the system monitor's view is different from 
the users’ view as shown in Fig.10. 


MOVE.L a,[M,-,1,0,-,-] ; read list a 
MOVE.L [-,R,n,1,2,3],dummy ; load first elem. 
Al:MOVE.L [M,R,2,3,2,3],47@- 3; move elem. to CPU 
CMP.L #NIL,[M,-,3,0,-,-] 3; end of list ? 
BNE Al 
MOVE.L b,[M,-,3,0,-,-] ; set b on IFR3 
A2:MOVE.L A7@+,[N,-,2,0,-,-] ; move elem. of a 
MOVE.L FCP,[M,-,1,0,-,-] ; get a cell 
MOVE.L [M,W,1,1,2,31,[N,-,3,0,-,-] 
; CONS & set current list top on IFR3 
CMPA.L #BOTTOM,A7 ; termination test 
BEQ A2 
finished 
a. Append - append list b to a. 
MOVE.L a,[{M,-,1,0,-,-] ; set A-list 
MOVE.L [-,R,n,1,3,4],dummy ; get first pair 
S1:CMP.L #NIL,{M,R,4,3,1,2] 3; termination test 
BEQ S2 
CMP.L [M,R,1,4,3,4],x ; variable match ? 
BNE Sl ; not match 
MOVE.L [N,-,2,0,-,-],value ; get value 
finished 
$2: CMP.L [M,R,1,0,-,-],x ; last var. match ? 
BNE error ; not match -> error 
MOVE.L [N,-,2,0,-,-],value ; get value 
finished 
b. Sassoc - find a dotted pair whose CAR is x in a. 
("-"" means don't care.) 


Fig.9 Example Procedures 
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2.Timeout 
4.CCR wait 


1.Selected for execution 
3.Suspended for 1/0, etc. 
5.Condition satisfied 


6.1/0 completed, etc. 


(Users' View) 


7.Shared var. updated, etc. 8.Selected for testing 


9.Condition unsatisfied 


Fig.10 State Transition Diagram of User Processes 
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4.2 Allocation of Processes to Processors 


C-LISP processes are created by MP and reside 
in the common memory area. For the process 
management task, the system monitor has process 
queues and process pools for every state. Figo wll 
shows the relationships of those queues and the 
system monitor. It shows that MP executes the 
housekeeping tasks, and IP's’ execute the 
interpretation tasks. - MP creates new processes, 
watches events, moves a process from pending/wait 
to ready/condition-test-ready, and monitors 
process termination, and IP's select processes and 
execute them. 


4,3 Inter-process Communication 


C-LISP processes communicate with each other 
via shared objects. All the shared objects are put 
on the common memory, so that they are always 
visible from MP and IP's. Communication messages 
are, therefore, buffered on the objects, and no 
direct communication between processor elements is 
necessary. Mutual exclusion for the shared objects 
is controlled by flags located at _ the shared 
objects: whenever IP's interpret CR or CCR, they 
test-and-set those flags. 

The system monitor synchronizes the processes 


for the inter-process communication. The 
synchronization operation is triggered by event 
messages put on the request message queue. IP's 
put the event messages on the queue when they 
recognize the local events such as update events 
of shared objects and release events of shared 
objects. 


4.4 Inter-processor Communication 


Since the system monitor is distributed to MP 
and IP's, the components communicate with each 
other for cooperation. As this communication is 
performed inside the system monitor, it is called 
Intra-OS communication. This system has two types 


of the intra-OS communication; the indirect 
communication via the request message queue and 
the direct communication via the interrupt 


interface circuit. (Fig.12 and Table 1 summarize 
the intra-OS communication.) The former is mainly 
used to transfer request messages of processes and 
event messages to MP. This type of communication 


Terminated 
rr 


Created 


Pending 


ry 
Ld 


in Execution 


Fig.11l Process Management 


is one directional, since MP directly replies to 
the relevant processes. The latter is used for 
very restricted purposes, while the former is used 
for many purposes. To let IP's be dedicated to 
interpretation, the direct communication is used 
only for the messages which must be transfered as 
soon as possible. 


4,5 Memory Management 


The memory management is very simple because 
no protection mechanism is needed. This system has 
the simple memory allocation and reclamation 
facilities. Each common area has memory allocation 
status descriptor which has the master information 
necessary for memory management, e.g. FCP. C-LISP 
processes get memory cells under the control of 
the managers of these descriptors. On the other 
hand, the system has the garbage collection 
facilities to reclaim the garbages in the 
allocated area as described earlier. 


5 Discussion and Conclusion 


In this paper, we presented the special 
purpose machine which comprises general purpose 
microprocessors. C-LISP was designed to apply 


multi-process description techniques to artificial 
intelligence problems instead of the conventional 
techniques such as backtracking and coroutines. 
From the experience on the interpreters developed 


Table 1 Intra-OS Communication 


From. MP’ to-LP 
Synchronization request for Garbage Collection 

Kill request of running processes 

From: LP to. MP 
Synchronization acknowledgement of G.C. 
Process termination report 

G.C. request (intentional G.C.) 


Indirect Communication 
Prom MP’ -to:--Le 
none 

rom IP “to MP 
I/O request 
Pending report of CR and CCR 

Event messages (Release shared objects, Update 
shared objects, and etc.) 


(Buffered Communication) 


F 


G.C. 


Sync. Ack. etc. 


G.C. Sync. Req. etc. 


Request Message eg: 3 


Reply for Request 
I/O Request for 1/0 
ato. Processes etc. IP's 


Fig.12 Intra-OS Communication 


Direct Communication (Interrupt Driven Com.) 
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earlier, the authors found that C-LISP programs 
usually have enough parallelism for implementing 
them on multiprocessor systems, in addition to the 
fact that LISP programs usually require very large 
computation power. 


The hardware configuration of this system 
strongly reflects the language feature of C-LISP. 
However, our system includes several key problems 


common among multi-microprocessor systems, such as 
the bus. bottleneck problem and the multiprocess 
environment problem, We chose general purpose 
microprocessors, Since it was considered expensive 


and time-consuming to make special purpose 
processors by hardware or by firmware. Thus, the 
system is constructed using the general purpose 
microprocessor with small yet powerful circuits 
for special purposes. 

Special purpose multi-processor systems 
consisting of general purpose processors will 


become more popular. We consider specialized small 
scale circuits may bridge the gap between the 
generality of the processors and the speciality of 
users’ application on such multi-processor 
systems. 
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A Multi-Micro System for I/O Intensive 


Applications 
F. M. Tse 
American Bell 
Denver, Colorado 80234 

Abstract -- During the acceptance test of a As part of the design objective, it is desired 
computerized product, it is desirable to be able that the system be fairly compact, modularly 
to test the load handling capability of the structured, and easily expandable. The Multi- 
system. This paper describes a load generation Micro System satisfies these requirements. It 
system using a multiprocessor architecture to employs a distributed-intelligence multiprocessor 
generate the large amount of data needed to load- architecture with a total of 25 processors in the 
test the DIMENSION® AIS™/System 85. system. There is one host processor responsible 


“Of: Sa 
unique 


The system described in this paper consists 
total of 25 processors and employs a 
bussing scheme. The system uses a master-slave 
organization and includes several levels of 
hierarchically ordered busses. The host processor 
is used for high-level task execution and overall 
timing control. The 24 slave processors are used 
for parallel high-speed data collection and 
compression. 


INTRODUCTION 


In today’s 
ensuring the 


business environment, the goal of 

product quality is becoming more 
challenging and important. During the initial 
acceptance test or subsequent reissues of a 
computerized product, it is often necessary to 
characterize or verify the performance of the 
system under heavy load. However, as 
semiconductor technology advances, systems are 
getting faster, more powerful and harder to test. 
Very often, it is quite difficult to generate 
enough load to test the performance of a system. 
Manual load testing has become a thing of the past 
and automated testing has become a must. 


The DIMENSION® AIS™/System 85% is the latest 
product of the DIMENSION® family. It includes a 
sophisticated digital switch capable of providing 
both Voice and Data Management services. In order 
to verify the operation of the switch under heavy 
usage, a load-generation environment is required. 
The Multi-Micro System is a part of this 
environment — it is designed for load and system 
testing of telephony features associated with 
electronic voice terminals. To simulate the 
actual operating environment, the system is 
connected to the switch at places where electronic 
voice terminals are normally connected. 


The system performs two main functions: 1) 
simulates heavy user activities on 96 electronic 
voice terminals and 2) verifies the responses. It 
implements these functions by performing the 
following low-level I/O operations: Voice terminal 
refresh data in the form of bit pulses is 
collected from 96 data channels (at 63Kb/s_ burst 
rate). This data is then collated, compressed, and 


stored into a data base for query. In addition, 
bit pulses simulating user activities must be 
transmitted back to the 96 data channels at _ the 


same time but at half the input data rate. 


a. ® Registered Trademark of AT&T. ™ Trademark 


of AT&T. 
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for high-level task execution and overall timing 
control. In addition, there are 24 slave 
processors responsible for parallel high-speed 
data collection and compression. 
SYSTEM ORGANIZATION 

The reason for choos ing a multiprocessor 
organization is simply because of necessity. 
First, due to the complexity’ of the data 
compression operation, it is obvious that random 


logic cannot be used to implement the 
keep the size reasonably small. 
of a single microprocessor is deemed inadequate 
because most microprocessors do not have the 
throughput to handle such a large amount of data. 
Finally, the use of a single microprocessor to 
interface with each of the channels would be 
excesSive. Based on these limitations, the choice 
of using a microprocessor to interface with four 


system and 
Second, the use 


data channels is a reasonable compromise - it 
requires each of the processors to process a 
reasonable amount of data, but not at an 
unattainable rate. 

While there are many different approaches’ to 


building a multiprocessor system (references [1] - 
[3] describe several different alternatives), an 
interesting hierarchically ordered interconnection 
scheme is used. Figure 1 shows a block diagram of 
the system organization and the following list 
summarizes some of the important characteristics 
of the system: 


e The system uses a tightly coupled, master 
slave type multiprocessor organization. 
There is one Intel 8086 host processor and 
twenty-four Intel 8089 slave processors[4] in 
this system. Each of the processors has a 
private local control store for efficient, 
independent, and parallel operation. 


e There are two levels of hierarchically 
ordered busses in this system - one trunk bus 
and two branch busses. These busses are used 
for interprocessor communication only. 


e The host processor resides on the trunk bus 
and the slave processors are connected by 
branch busses. A global memory communication 
scheme is used between the host processor and 
the slave processors. 


e Each of the slave processors operates two 
direct memory access (DMA ) channels 
simultaneously. One reads the input data 


stream while the other transfers data to the 


HOST PROCESSOR 


TRUNK BUS 


24 
GLOBAL MEMORY () 


, ' 
; IOP BUS = 


1OP #2 


acne 
1OP BU 


output. Furthermore, there are four data 
data streams multiplexed into each of the DMA 
channels. 
With these characteristics in mind, the system 
architecture can be described as follows. There 
is one host processor in this system and it is 


connected to a trunk bus. The trunk bus is in 
turn connected to two branch busses via branch bus 


gateways. For simplicity, the branch bus uses a 
simplified version of the Intel MULTIBUSD 
interface. It is a multi-master global bus and 


there are 12 slave processors connected to each of 


the branch busses. Moreover, either the host 
processor or one of the 12 slave processors 
residing on that bus can become the bus master. 


All data transfers on this bus are 16 bits wide 
and it has a worst-case transfer rate of over 800K 
words per second. 


On each of the branch busses, there is a block of 
16K bytes of global memory. Since this memory can 
be accessed by any of the bus masters, it can be 
accessed by either the host or any of the 12 slave 
processors. To simplify the system hardware, a 
simple software communication scheme is used - the 
host processor is restricted to communicate with 
the slave processors through the global memory 
only. 


b. MULTIBUS is a patented Intel bus. 
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FIGURE 1. SYSTEM ORGANIZATION AND BUSSING SCHEME 
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During normal operation, 
continuously update the data base stored in the 
global memory. When the host processor. needs 
information about the system, it looks up the 
proper location in the global memory. Likewise, 


the slave processors 


when the host processor wants’ to issue an I/O 
command to a slave processor, it writes’ the 
command into specific locations in the global 
memory. 

This system architecture was chosen for this 
application because it exhibits the following 
characteristics: 


e Information compression - one of the inherent 


characteristics of a hierarchically ordered 
system is the ability to filter out 
unnecessary information before passing it up 


the hierarchies. In this system, the _ slave 
processors perform this filtering function. 
In fact, the ratio of data read from _ the 
channel versus those written into the global 
data base is approximately 1000 to 1. As a 
result of this information compression 
action, the host processor control program is 
simplified because it does not have to 
perform the time critical tasks. 

e Efficient communication mechanism - in this 


application, the combination of global memory 
and dual branch busses offers more than 
adequate communication bandwidth and requires 
little protocol overhead. 


e System modularity —- the bussing scheme is 
expandable. In fact, the system is actually 
equipped with four branch busses, of which 
only two are currently used. If even further 
expansion is needed, the implementation can 
be reconfigured into a multi-dimensional 
bussing arrangement interconnecting multiple 
host processors. 


SYSTEM HARDWARE 


There are a total of 28 circuit boards in the 
system, including a CPU board, a bus control 
board, two branch bus gateways, and 24 IOP boards. 


HOST PROCESSOR 


The host CPU is an Intel 8086 operating in maximum 
mode[4]. It is connected to two different busses - 
an internal private bus called the host bus and an 


external bus called the trunk bus. To simplify 
the system timing control mechanism, to allow 
parallel operations, and to provide some data 
privacy among the processors, the host CPU 


performs all of its instruction fetches and local 
data accesses from on-board memories. The host 
CPU accesses the trunk bus only when it has to 
communicate with the IOPs. 


BRANCH BUS GATEWAY 


For each of the Branch Busses in the system, there 


is an associated circuit board called the branch 
bus gateway. The gateway board can be divided 
into three sections: the trunk bus to branch bus 


interface buffers, branch bus arbitration circuit, 
and the global memory. 


I/O PROCESSOR BOARD 


The IOP board contains all the circuitry of a 
slave processor and its associated peripherals. 
The processor section contains an Intel 8089 1/0 
Processor (IOP) operating in remote mode[4]. The 
IOP can access two different busses —- an internal 
private IOP bus or the external global branch bus. 
Normally, the IOP performs instruction fetches and 
DMA operations on its local IOP bus. The branch 
bus is used for data base updates and 
communicating with the host processor only. 


There are several advantages in using the 8089 IOP 
to control the operation of the IOP board. The 
IOP is specifically designed to operate as a slave 
to. the 8086 processor. Also, the IOP has two 
built-in DMA controllers to facilitate high speed 
data transfers. Finally, the IOP can actually 
execute programs residing in either its private 
store or the global memory. During the program 
debugging phase, the machine code for the IOP is 
downloaded through the host processor into the 
global memory. The IOP program is transferred 
into EPROMs only after it has been stabilized. 
This feature is quite useful for the program debug 
procedure. : 


Actually, the IOP can be considered as two 
processors in one. There are two identical 
channels inside the IOP and each of the channels 
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can operate in either programmed mode or DMA mode. 
In the programmed mode, a channel can operate as 
if it is a normal microprocessor running under the 
control of a channel program. Moreover, a channel 
can enter into DMA mode by the execution of a 
single instruction. In the DMA mode, the channel 
stops the execution of the channel program and 
operates as a DMA controller. It can perform 
port-to-port, port-—to-memory, or memory-—to-memory 
transfers at high speed. 


In the current design, both channels and _ both 
modes are being used. One of the channels is 
dedicated to collecting data from the input and 
the remaining channel is dedicated to sending data 
to the output. 


The input data stream has several unique 
characteristics. The data arrives in bursts at 
25-millisecond intervals and the amount of data in 
each burst alternates between two different 
formats. For this reason, both of the channels 
are designed to operate on 25-millisecond frames. 
At the beginning of each frame, both channels 


operate in DMA mode —- one channel collects data 
into its local memory while the other’ channel 
sends data stored in local memory to the output 
circuits. : 

When a sufficient amount of data has been 
collected (usually between 5 to 20 milliseconds 


into each frame), the system hardware generates an 
interrupt signal to force the IOP to leave the DMA 


mode. The IOP then enters into the programmed 
mode, runs under the control of the channel 
program, performs the collating and compression 


operations, updates the global data base, executes 
I/O commands from the host processor, and reenters 
the DMA mode again. 


SYSTEM SOFTWARE 


An important part of this system is the global 
memory communication structure. The host 
processor can communicate with the I0Ps_ only 
through these software structures. As with all 


multi-processor systems that use global memory for 
interprocessor communication, it is important to 
Maintain process synchronization and data privacy 
between the various processors. In this system, 
privacy is ensured by having a separate local and 
global memory. 


The global memory consists of three major data 
communication structures: a system status database 


storage, a host-to-IOP communication interface, 
and an IOP-to-host processor interface. To assure 
data synchronization between the host processor 
and the IOP, several simple but effective 
techniques are employed. For example, some of the 
data structures are restricted to be 
unidirectional only. Other techniques employed 


include semaphore constructs and event counters. 


The source program for the 8086 CPU was developed 
on a UNIX™-based© VAX 11/780 minicomputer. Except 
for the I/O drivers, most of the routines are 
written in the "C" programming language[5]. The 


c, ™ UNIX is a trademark of Bell Laboratories. 


system is written to be task-oriented. In the 


current implementation a task can be in one of 
five states: it can be waiting for execution, in 
execution, sleeping, Waiting for response from 


the IOP, or waiting on the output device. 


To minimize execution time and program size, the 
program for the 8089 IOP is written in assembly 
language. It consists of two separate control 
programs, one for each of the channels of an IOP. 
Although these two channels can operate 
independently, the DMA input and output transfers 
are always kept in synchronization by signals from 
the external hardware. As mentioned earlier, both 


control programs are written to run on 25- 
millisecond frames. Moreover, some of the 
processing, such as the compressing function, is 


spread over several 
the processing time 
and enables the 
requirements. 


frames. This approach reduces 
within each individual frame 
IOP to satisfy real-time 


SUMMARY AND CONCLUSIONS 


The Multi-Micro System is currently being used for 


in-house system testing. We believe that the 
architecture described in this paper is well 
suited for real-time critical and I/O intensive 


applications. The use of private local storage in 
each processor and multiple global busses in the 
system are essential for efficient, independent, 
parallel operations. Additionally, the highly 
modular organization allows the users to add or 
delete test resources easily. The built-in DMA 
controller feature of the slave processor enhances 
the throughput of the system significantly and 
enables the designers to implement such a compact 
tool. 


In the future, more fault-tolerant features should 
be added to the system. For example, overall 
system reliability can be further increased by 
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adding a redundant host processor. Moreover, the 
host processor software should be expanded to 
include periodic maintenance routines to audit the 
health of an IOP board and restart those that have 
failed. 
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PIPELINE AND PARALLEL ARCHITECTURES FOR COMPUTER COMMUNICATION SYSTEMS 
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ABSTRACT 


Various existing communication prece- 
sser systems (CPSs) at different nodes in 
computer communication systems (CCSs) are 
reviewed for distributed precessing sys- 
temse To meet the increasing load of 
messages, pipeline and parallel archi- 
tectures are suggested in CPSs. Finally, 
pipeline, array, multi and multiple-pro- 
cessor architectures and their advantages 
in CPSs for CCSs are presented and ana- 
lysed, and their performances are compa- 
red with the performance of uniprecessor 
architecture. 


INTRODUCTION 


Over the past few years, we have seen 
the emergence of a new generation of come 
puter technologye This causes to increase 
the availability of low cost mini/micro- 
computers, intelligent terminals, etce, 
and further greately reduces the cost fer 
unit of computing power. This leads te 
tremendous increase of computer users and 
their requirement for enormous computing 
power to assist in day-to-day human life 
and solve a number of problems in several 
areas of science and technology. The best 
way to meet all the requirements of pre- 
sent user pepulation is through Distribu- 
ted Processing Systems (DPSs)- 


Distributed Processing Systems 


These DPSs bring the computing power 
nearer to the computer users irrespective 
of their geographical distribution. Un- 
like centralized computer systems, these 
systems are less expensive, and increase 
processing power and shorten instruction 
execution time with the addition of new 
computer users/terminals. The main 
principle of DPSs is to do the proce- 
ssing where the data ise This leads toa 
reduction of the data communication tra- 
ffic, since data processing usually invol- 
ves data reductione These DPSs increase 
the redundancy, fault-tolerance, reliabi- 
lity and sharing of expensive resources» 
They are more flexible, versatile, dyna- 
mic, reconfigurable and expandable, en- 
courage construction of dedicated systems, 
distributed databases and multi-micro- 
processor systems, give more throughput 
due to inherent pipelining and paralli3a_ 
and require less incremental cost than 
centralized systemse They further provide 
remote access to a variety of resources, 
access to databases, a facility for 
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exchanging personal messages, etc. In 
these systems with distributed control 

it is more natural to communicate through 
messages rather than through synchronized 
signals. With proper autonomy and storage 
in the message transport system, processors 
do not communicate in future systems with 
many processorse 


Computer Communication Systems 


Seeing the advantages of DPSs, the 
demand for DPSs is growing very fast 
along with the computer user population. 
This trend increases the number of proce- 
ssing nodes as well as the complexity of 
DPSse At the same time the exchange of 
messages between nodes is also increasing. 
To streamline the passage of messages in 
between processing nodes, the requirement 
for advance computer communication systems 
(CCSs) is also increasing. 


With the recent advances in computer 
communications, protocols, computer net~- 
works, communication software, operating 
systems for networks, etc.e, many CCSs 
like ARPANET, ALOHA, ETHERNET, CYCLADES, 
etc. have come into operation and they 
are becoming very popular with their use 
in DPSs- With the present advances in 
CCSs, the development of DPSs is also 
progressing. 


As the computer user population and 
the demand for DPSs is increasing, the 
demand for high throughput CCSs is also 
increasinge New techniques like multi- 
processing [1], parallel processing [2, 3] 
have been proposed to improve the through. 
put of CCSs and provide service facilities 
to all the computer users- At the same 
time the demand for high throughput 
communication processor systems (CPSs) 
{4,5} is also increasing to meet the de- 
mand for processing all the messages, at 
any node, within reasonable time. Recent 
advances in computer architectures, LSI, 
VLSI, VHSI has caused significant changes 
in the area of computer architecture. 
These new developments are encouraging the 
growth of high throughput and reliable 
CPSs for CCSs in DPSs [5]- Many CPSs 
capable of performing a wide range of 
communication applications including cir- 
cuit switching and store-and-forward 
switching or a combination of these two 
in CCSs are under operation [5]. 


COMMUNICATION PROCESSOR SYSTEMS 
Bell Labs have developed many systems 
[6-9] for stored program controlled. 


switchinge The GTE Sylvania, Inc- has 
proposed a CPS for combination of circuit 
and store-and-forward switching [5]. A 
multiprocessor, ETS-4[10] for circuit 
switching and another multiprocessor, Cmmp 
[11,12] as a switching node of the ARPA net- 
work for combination of circuit and store- 
and-forward switching have been developed. 
Applying associative and parallel proce- 
ssing techniques the Honeywell research 
group [13] proposed an architecture for a 
combination of circuit and store-and- 
forward switching to implement demultiplexer 
and multiplexer functions- The North 
Electric company has proposed CPS [14] 

for distributed processor system. The 
Pluribus system fi5.16] is used as a modular 
switching node for the ARPA network. 


Looking at the evolution of the CPS ar- 
chitecture, we can fing that the trend 
has been from uniprocessor architecture to 
multiple-processor architecture- As the 
communication requirements continuously 
increase, the complexity of CPSs will keep 
growing and more functional modules would 
be included in the system{17~-19]. 


PROCKS350R_ARCHITSCTURES 


At every node in a CCS, the CPS receives 
the messages and stores them in its memory 
and routes them to their next destination to 
reach within the shortest possible time. 
Before routing, the CPS performs operations 
like frame decomposition, control packet 
processing synchronization, field proce- 
ssing, control packet formulating, receive 
and transmit tables updating, and frame 
composition, on each message If a single 
processor is used to do all these operations, 
it will take longer time to process all 
the incoming messages- The processing time 
can be reduced by using more than one pro- 
cessor arranged in pipeline and parallel 
processor architecturese Eventhough, 
theoretically, the performance of n pro- 
cessor system will have throughput equal 
to n times that of uniprocessor, due to 
practical limitations it will always be 
less than the theoretical valuee The CPSs 
having the pipeline and parallel archite- 
ctures handle very high loads and will 
have very high throughput rate. In this 
paper, some resource sharing architectu- 
res having n processors to increase the 
throughput of CPS are proposed and their 
performances over a uniprocessor are 
analysed- 


Uniprocessor 


A uniprocessor CPS performs all the 
operations on the messages in a sequential 
manner. For example, if the CPS receives 
m messages, each message requires n steps 
to complete the execution of all opera- 
tions and each step requires an average 
execution time t. seconds then the time 
required to process all the m messages 
on a uniprocessor is 
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to = mnt. sec. (1) 


If different operations on each message 
or different messages are simultaneously 
executed on different processors then 
the overall time required to process all 
the messages will be reduced. This in- 
creases the speed of processing and 
throughput. To achieve this goal, the 
following pipeline, array, multi and 
multiple-processor architecture schemes, 
more suitable to execute different opera- 
tions or messages simultaneously in a 
number of processors, to increase the 
Overall throughput of CPSs are proposed. 


Pipeline Processor 


In this scheme all the operations to 
be performed on each message are partiti- 
oned into n steps of equal average execue- 
tion time t, and all the steps are execu- 
ted inn different processors onn diff- 
erent messages simultaneously. All the n 
processors are arranged in series, like a 
pipeline and the messages pass through all 
the n processors one after the other in 
nt. seconds, completing the execution of 
alj the n steps of operations on alln 
processorse The total time required to 
process m messages in a pipeline proce- 
ssor architecture having n processors is 


t. = (m+n)t. sec. 
= mt. sec. for m>>>n (2) 
Comparing equations (1) and (2), one can 


see that the pipeline processor requires 
only 1/n of the time required by the uni- 
processore The average message process 
delay, in this scheme, is equal to 1/n 
of the message process delay caused bya 
uniprocessor and its throughput will be n 
times that of a uniprocessor. Since the 
pipeline scheme has n processors ani all 
are connected in series, a processor fail- 
ure will effect the whole scheme and so 
its reliability is much less than uni- 
processor. The higher reliability and 
throughput can be achieved by applying 
multiprocessor techniquee 


Multiprocessor 


In this scheme, all the operations to 
be performed on each mesSage are partition- 
ed in to independent steps and all the 
possible independent parallel steps are 
simultaneously executed inn different 
processors on the same message. Since 
the execution time of all the steps are 
not equal it requires complicated schedul- 
ing to get optimum performance from all 
the processorse In this case, throughput 
will be nearly equal to n times that of 
uniprocessor, slightly less than that of 
pipeline processor.e This is because the 
non-utilization of full n processors capa- 
city due to practical limitations. In this 
case the failure of one processor will not 
effect the functioning of other processors. 


Hence, its reliability is much better than 
pipeline and uniprocessors- In pipelineand 
multiprocessors, the expandability cost will 
be highere A suitable architecture for 
fast increasing loads with lower additional 
cost for expandability is array processor. 


Array Processor 

In this all the n processors are arranged 
in the form of an array and n different me- 
Ssages are processed inn different proce- 
ssors simultaneously. An instruction is 
executed simultaneously in all the n pro- 
cessors on all the n messages present in 
them. The throughput of an array processor 
will be equal to n times than that of uni- 
processore In this case any one processor 
failure will not effect the functioning 
of other processorse This increases the 
reliability ani it is nearly equal to that 
of multiprocessor. Unlike multiprocessors, 
in this case, addition of additional pro- 
eessors is easier and it does not take 
complicated scheduling to share all the 
resources optimally- The main draw- 
back in this scheme is, at any time, two 
or more processors can try to access same 
resourcee This creates additional opera- 
ting system design problemse This can be 
overcome by using multiple-processor tech- 
nique. 


Mult iple-processor 


In this scheme all the n processors 
shares all the resources and operates in- 
dependently.- At any time, n messages are 
processed independently on n processors 
and whenever two or more processors try to 
access a Single resource, then all these 
requests are put in a queue and they will 
be served on FCFS queue discipline. This 
delays only those messages whose processor 
requests are in the waiting queue and all 
the other processors continue executions 
without waitinge Due to this fact, the 
throughput of a multiple-processor will be 
slightly less than that of pipeline pro- 
cessor, but its reliability and expanda- 
bility are greater than the others. The 
average message processing delay, in this 
case, is nearly equal to that of the delay 
caused in case of pipeline and array 
processors. 


CONCLUSION 


Out of all the above schemes, pipeline 
scheme has got higher throughput and least 
reliability. If high reliable pipeline pro- 
cessors are available then this scheme is 
very suitable where the throughput of CCS is 
almost steady and failure of CPS will not 
cost muche On the other hand, array , 
multi and multiple-processors have high 
reliability and expandability than pipe- 
line scheme and their throughput is also 
nearly equal to that of a pipeline processor 
scheme. If one considers throughput, relia- 
bility and expandability, then the best 
suitable scheme will be multiple-processor 
architecture. 
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ABSTRACT 

This paper describes the architecture of a 
multiprocessing system for voice/data switching. 
The system consists of a large number of units 
called "Processing Elements," which have special- 
ized and dedicated functions. Due to their paral- 
lel operation, a high degree of concurrency can be 
achieved. 


of the system is described. 
pertaining to memory = access 
times are presented. The operating system is 
described. Certain analytical results pertaining 
to blocking probabilities of packets, and utiliza- 
tion of the Processing Elements are discussed. 


The architecture 
Simulation results 


INTRODUCTION 

The last decade has witnessed a tremendous 
growth in the amount of information that is trans- 
mitted digitally. Besides the digital traffic 
generated as aresult of computer communication, 
applications like digitized voice, electronic 
mail, etc. have further intensified the need for 
efficient transmission of digital data. Integrat- 
ing these various kinds of data onto a_ single 
channel is an attractive solution for an Inte- 
grated Services Digital Network due to the greater 
channel utilization that is thus accrued (1). 
However, due toa variety of switching schemes 
that are found to be efficient for different types 
of data [2], and due to the differences in proto- 
cols used by different hosts, a fairly sophisti- 
cated switching and routing center is required. 
Advances in VLSI have made it feasible to realize 


systems with distributed architectures, making 
integrated switching schemes economically viable 
(3),C4]. 
SYSTEM ARCHITECTURE 
It is desired to have an Interface Message 


Processor capable of handling data generated by 
sources of different kinds. These different kinds 
of data need to be integrated in a way that would 
optimize certain system parameters (cost, for 
instance) subject to certain system constraints, 
such as delay, error rate, etc. To achieve high 
throughput, and to permit many different switching 
strategies to be implemented concurrently, a mul- 
tiprocessing architecture is desirable [4]. 


The system that we describe has a large number 
of units called "Processing Elements" (PE's). A 
set of PE's forms a "cluster." There are 25 
"clusters" in the system. Each "cluster" consists 


0190-3918/83/0000/0151$01.00 © 1983 IEEE 


151 


of aPE called the "Circuit Switching Element" 
(CSE) and four other PE's called the "Packet Pro- 
cessing Elements" (PPE's). These five PE's ina 
"cluster" are interconnected over a bus, called 
the intra-cluster bus. Each "cluster" has a Local 
memory called the "cluster memory," which it uses 


for storing instructions and data. Each "cluster" 
exercises direct control over one Switching 
Matrix. The CSE is responsible for controlling 


the switching operations in the Switching Matrix, 
whereas the PPE's are in charge of buffering and 
transmitting packets. 

The "clusters" are again connected, via a 
global bus, to three memory banks (shared memory) 
and a CPU (Fig. 1). Two devices, called the Bus 
and Interrupt Controller, and a Bus Multiplexor, 
are used for efficient management of the global 
bus and the memory banks respectively. The PE's 
in a cluster use a token ring protocol to govern 
the use of the intra-cluster bus. 


resources 
It 


The CPU acts primarily to disburse 
for the different tasks in an orderly manner. 
constructs the "Switch Control Words," which are 
executed by the CSE's. These are similar to the 
Channel Commands in a_ general purpose machine. 
The Switch Control Words are prefetched by the CSE 
into its cluster memory, which, thus, acts as a 
cache memory for the CSE. In executing the Switch 
Control Words, the CSE connects one Line to 
another (circuit switching) or a Line to a Packet 
Processing Element (when packets need to be buff- 
ered). The number of Lines is more than the num- 
ber of Packet Processing Elements. This is due to 
the fact that sometimes, certain kinds of data may 
be circuit switched. A set of Switch Control 
Words forms a “Switching Program," which, when 
executed, implements a specific switching scheme. 
The CSE also determines the manner in which pack- 
ets would be received by the PPE, and so on. 


PERFORMANCE ANALYSIS 


The system was simulated using GPSS-V. Spe- 
cifically, it was desired to estimate the time 
required for an access from the shared memory. It 


was observed that, for a memory with a cycle time 
of 250 ns., the CPU, operating in a high-priority 
mode, was able to access the memory within 300 ns. 
(?7% of the time, and the Processing Elements only 
33% of the time [4]. 


the 
the 


Certain analytical results 
probability of blocking were 


pertaining to 
derived. For 


was assumed that a switch- 
and the cluster that cont- 
rolled it had M PPE's. It was further assumed 
that the traffic could be divided into two broad 
categories, viz., circuit switched traffic Chence- 
fortn referred to as voice-streams) or packet- 
Switched traffic (henceforth referred to as data 
packets). Fig. 2 shows the blocking probabilities 
for circuit-switched traffic (voice-streams) and 
packet-switched traffic (data packets), as a func- 
tion of the number of PPE's per cluster. It is 
observed that as the number of lines’ present 
increases, the blocking probability of a data 
packet increases. This is because as the number 
of lines increases, the blocking probability of 
voice-streams (i.e. circuit~switched traffic) 
decreases, and this makes the Lines less available 
for data packets. Blocking 
kinds of traffic. Fig. 3 shows the average number 
of PPE's in use as a_ function of the number pres- 
ent in a cluster. The ratio of the average number 
of PPE's in use and the number of PPE's present is 
the utilization of the PPE's. It is observed that 
as the number of Lines and the amount of data 
traffic increases, the average number of PPE's 
utilized increases, and approaches the ideal Limit 
depicted as a straight line in the figure. This 
Situation corresponds to 100% utilization of the 
PPE's. For other cases, the average number of 
PPE's utilized increases with the number of PPE's 
present, but only upto a certain point. 


purpose of analysis, it 
ing matrix had N Lines, 


SOFTWARE 

A group of users who wish to communicate are 
normally allocated Lines in one Switching Matrix. 
This is to reduce the overhead of moving packets 
from one cluster memory into another. The operat- 
ing system creates a series of processes for each 
set of users that use a particular protocol, or 
constitute a group whose members communicate pri- 
marily with other members within the group. Dif- 
ferent system utilities serve to implement differ- 
ent protocols, which are executed to create a 
particular pattern of “Switch Control Words" nec- 
essary to implement a particular protocol. 


A routine called the Line Allocation Module 
performs Line allocation. Users that need to com- 
municate frequently are allocated contiguous Lines 
in the same switching matrix (to the extent possi- 
ble). Allocation of contiguous Lines to similar 
users and the implementation of a specific proto- 


col by the CSE creates the effect of a dedicated 
system. 
Another important module called the "Memory 


Management Module" is in’ charge of allocating/de- 
allocating the shared memory. It uses the Parti- 
tioned memory management scheme, and the best-fit 
algorithm. It allocates the memory so as to bal- 
ance the occupancy in each bank, thus increasing 
concurrency in memory accesses. 


CONCLUSIONS 

This paper discussed the 
multiprocessing system for 
ing. Some of the 
discussed. 


architecture of a 
circuit/packet switch- 
simulation results were also 


could occur for both 


_ plete servicing any one of the n voice-streams. 


An overview of a part of the necessary — 
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switching software was presented. 


APPENDIX : 

We address the problem of determining the PPE 
utilization. Specifically, if a Switching Matrix 
has N Lines and the "cluster" that controls it has 
M Packet Processing Elements (M <N), what is the 
average number of PPE's being utilized at any 
instant? We also attempt to find the probability 
of a data packet or a voice-stream being blocked. 
For the purpose of analysis, a stream that needs 
to be circuit switched could also be viewed as a 
packet to be transmitted instantaneously (without 
buffering and without engaging a PPE). Assume 
that the arriving traffic can be divided into two 
broad categories, i.e., "packets" that need to be 
circuit switched (voice-streams) or those which 
need to be buffered (data packets). A voice- 
stream, hence, needs two lines, but a data packet 
requires a Line and a PPE. We further assume that 
the arrival pattern is poisson, with a mean arri- 
val rate of ),. for voice-streams, and a_ “mean 
arrival rate of Aq for data packets. Each packet 
engages the lLine(s) or the PPE for the Length of 
time determined by the bit rate, and its size. 
This is the service time, and is assumed to be 
1/pv for voice-streams, and 1/Hd for data packets. 
The parameter of interest is r, the number of 
Lines that are tied up at any instant; ands, the 
number of PPE's engaged in service. This repre- 
sents the situation when s data packets = and 
(r-s)/2 voice-streams are being serviced. 


Let p(n,q) be the probability of a_ cluster 
being in a state (n,q), which corresponds to ‘'n' 
voice-streams, and ‘q' data packets currently 
being serviced by the cluster. This corresponds 
to the case of r = 2(n) + qands =q. These 
states are depicted in the Markov-chain in Fig. 4, 
for the case of N = 8, and M = 4. A transition 
can occur from the state (n,q) to the state (nt1, 
q) with the probability ), (dt) ina time inter- 
val dt, provided (n+1, q) is an accessible state. 
This corresponds to an arrival of a voice-stream. 
Similarly, a transition could occur from the state 
(n,q) to the state (n-1, q) with a probability of 
n[l, (dt). This is because the cluster might com- 
A 
data packets. Hence, 
voice-streams cause transitions along 
the columns, and data packets cause transitions 
along the rows. Each of the accessible states 
(n,q) satisfies the following condition: 
O<q<M, and 0 < 2(n) +q<N 


similar argument holds for 
in Fig. 4, 


(1) 


Solving for the state probabilites yields the 
following equation for p(n,q), the state probabil- 
ites for states (n,q) that satisfy (1): 
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A CLASS OF GRAPHS FOR PROCESSOR INTERCONNECTLON™ 


Delt: Reddy', Ps Raghavan” and J.G. Kuhl! 


Abstract -- Two classes of graphs with N = 2” 
nodes, diameter log )N, ~ N or < 1.5N links and 
node connectivity of 2 are presented. These 
graphs appear to be suitable for interconnection 
applications in computing networks. Even though 
nodes in the graphs have different degrees (at 
most n+l different degrees), the graphs have 
structured interconnection patterns. The 
maximum degree of the nodes in the graphs is n. 


I. Introduction 


Several researchers have studied the 
problems of network topology for computing 
networks based on the graph model [1-13]. Some 
of the properties of graphs that are important 
to this application are diameter, node and link 
connectivity, modularity, maximum degree of a 
node, the number of links, routing algorithms 
and the effect on diameter when a node(s) or 
link(s) is removed. Most of the current 
research in network topology has dealt with 
regular graphs (graphs in which the degrees of 
all nodes are equal) or “essentially regular” 
graphs [1-13]. In this paper we present two 
classes of graphs with os nodes, diameter n and 
node connectivity of 2. These graphs have nodes 
with varying degrees (2 through n). In one 
class of graphs, on an average, there are less 
than 1.5 links per node and in the other class 
of graphs, on an average, there is one link per 
node. 


A brief definition of the terms used 
follows. The distance between two nodes, say i 
and j, of a graph is the number of links in a 
shortest path between i and j and the diameter 
of a graph G is the maximum of the distances 
between pairs of nodes of the graph. The node 
(link) connectivity of a graph G is the minimum 
number of nodes (links) that have to be removed 
for G to become disconnected or reduce to a 
single node. It is known that the link , 
connectivity of a graph is less than or equal to 
its node connectivity. The t-node-deleted (t-_ 


link-deleted) diameter of a graph G is the 
maximum of the diameters of the subgraphs of G 
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that are obtained by removing some t nodes (t 
links) [16]. The size of a graph is the number 
of links (edges) in it. 


The diameter of a graph G gives a measure 
of the worst case delay in a message switched 
network and the connectivity of G gives a 
measure of the fault-tolerance of the network 
modeled by G. 


Il. Graph Construction 


The degree of a node in a graph G gives the 
number of communication links (physical or 
logical) incident on a node. Clearly it is 
important to keep the number of communication 
links small to achieve the desired diameter and 
connectivity. Most researchers have 
concentrated on the construction of regular 
graphs with small degrees. One of the most 
widely studied class of regular graphs is for 
degree 3 [3-5,7-9,11]. A simple lower bound by 
Moore [14] shows that the diameter of degree 3 
regular graphs of N nodes should be * logoN. 

The best known construction achieves a diameter 
of 1.472 log,N [8,ll]. Note that the size of a 
N node regular graph of degree 3 is 1.5N. A 
binary n-cube, which has also been proposed as a 
possible topology for computing networks [13], 
is a regular graph of degree n with 2" nodes and 
node connectivity of n. However the size of the 
binary n-cube is n¥2™ *. Another class of 
popular graphs for computing networks is loops 
[15]. These graphs are regular with degree 2, N 
links and have node connectivity of 2. The 
diameter of the loops is however N/2. The 
popularity of loops stems from the simple 
routing, modularity (i.e. nodes can be added and 
deleted without changing much of the network 
connections) and tolerance to single faults 
[15]. For the number of links, maximum degree 
of a node, diameter and node connectivity it can 
be seen that the proposed graphs fall between 
loops and binary n-cubes. We summarize this 
comparison in Table I. Before we present the 
details of the construction it is important to 
consider reasons to construct graphs presented 
here. 


If one wants to construct graphs of small 
diameter and minimum number of links, one 
excellent solution is the graph shown in Figure 
1. Diameter of this graph is 2 and its size is 
N-l1. However the node connectivity of this 
graph is 1 and hence the modeled computing 
network has a single point failure. A graph 
with node connectivity 2 would require that the 
degree of each node be at least 2. If it is 
exactly 2, the graph would be a loop, with O(N) 
diameter. Hence to achieve smaller diameters 
one must allow the minimum degree of the nodes 
to be greater than 2. However the maximum 
degree of nodes and the total number of links 


should be kept low, to keep the complexity of 
the individual nodes and the complexity of the 
interconnections reasonable. Systematic methods 
to interplay maximum degree, size, diameter and 
connectivity of the graphs appear to be 
extremely difficult to obtain (this research 
topic is called Extremal Graph Theory [14]). 
The work presented in this paper can be viewed 
as an emperical solution to this problem. 

2.1 Class I Graphs 

Next we give a class of graphs with N = 2 
and size less than 1.5 N. We call this class of 
graphs Class I Graphs. The construction 
procedure we give uses three basic graphs with 
diameters 1, 2 and 3, shown in Figure 2. The 
graphs with higher diameters are constructed 
iteratively using these graphs. Since the 
graphs we would construct have 2" nodes we can 
use a binary n-tuple to label the nodes. The 
labels to be used for the first three graphs, 
are shown in Figure 2. 


Let the labels of the nodes of the graph to 
be constructed be x__, x 4 
Xx; € {0, 1}. The set of nodes which have 
identical values in some k fixed positions 
constitute an (n-k)-cube and is a subcube of the 
binary n-cube with 2" “ vertices. The subcubes 
are denoted by placing dashes (-) in the (n-k) 
positions where the values are not fixed. For 
example 00--- represents the 3-cube 
{oo000, 00001, 00010, 00100, 00001, 
00101, 00110, 00111}. This notion of the 
subcube is useful in following the algorithm 
given next. 


Kn—-2°°0°% Xo» 


Algorithm Graph Construct: 
Let G, be the graph to be constructed. 


1. If n= 1, 2, 3, then G, is defined by the 
graphs in Figure 2. If n > 3, then for 
each 3-cube Xp Xn-229 99X37 include the 
edges of the graph shown in Figure 2(c). 
Note that there will be 2" " subcubes and 
the edges in each subcube are placed by 
referring to Figure 2(c) and neglecting the 
first (n-3) components of the node labels in 
each component 3-cube. 


2. Apply step 1 recursively to the 


(n-3)-cubes Sn 600 and eee 42100. 


Figure 3 illustrates the construction 
procedure proposed. In Figure 4 we give an 
example of constructing the 32 node graph by 
applying algorithm Graph Construct. Some of the 
basic properties of the graphs constructed above 
are investigated next. 

2-1-1 Properties of Class I Graphs 

Theorem 1: For any n 2? 1, algorithm 
Graph Construct produces a 2" node graph with 
diameter n. 
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Sopot) + [n/3} 


Theorem 2: For n 2? 1, the number of links 
in G, the graph constructed by applying 
algorithm Graph Construct, is 


5 + gin/3], (n(mod3))7, 


where [x] is the interger part of x. 


Corollary l: The number of links inawN 
node graph constructed by algorithm 
Graph Construct, is less than 1.5 N. 


Theorem 3: In a N = 2" graph constructed 
by algorithm Graph Construct, there are nodes 
with different degrees and these degrees range 
from 2 through n. 


Theorem 4: For n > 1, the node 
connectivity of a graph constructed by algorithm 
Graph Construct, is 2. 


Theorem 5: For n 2? 2, the l-deleted 
diameter of a N = 2” graph constructed by 
algorithm Graph Construct, is at most n+2 if n # 
4 and for n=4 it is 7. 


2.2 Binomial Graphs 


The second class of graphs being proposed 
are called Binomial Graphs. The construction of 
binomial graphs is illustrated in Figure 5. The 
number of nodes N in the graphs is 2". Two 
nodes, called the super nodes, have degree n and 
the other nodes have degrees of 2 through 


(n+1)/2. The nodes are arranged into (n+l) 
levels, with (.) nodes in ith level, 
O< i<n. Edges connect nodes in ith to nodes 


in (i+l)th and (i-1)th level only, 

1 < i< (n-1). Furthermore every node in the 
ith level is connected to exactly one node in 
the (i-l)th level, 1 < i < [n/2]. Every node 
in (n-l-j)th level is connected to exactly one 
node in the (n-j)th level, 1 < j < [n/2]. For 
n an odd integer a node in [n/2]th level is 
connected to exactly one node in ({n/2] +1)th 
level. As long as the connection rules given 
above are satisfied, the details of which node 
is connected to which particular node(s) does 
not have an impact on the diameter and 
connectivity of the Binomial Graphs. However to 
keep the maximum degrees of nodes small, we also 
require that the edges be distributed such that 
the maximum difference between the degrees of 
nodes, at the same level, be 1. Examples of 
Binomial Graphs, for n=4 and 5, with the 
additional restriction on the distribution of 
edges, are given in Figure 6. Some of the 
properties of the Binomial Graphs are given 
next. 


Theorem 6: The diameter of the Binomial 
Graphs with N=2" is n and the l-deleted diameter 
is 2n-l. 


Theorem 7: The node connectivity of a 
Binomial Graph is 2. 


Theorem 8: 


i =7l 4 -2t ie - 
with N=2” is N-2 Crn/2)? 


The size of the Binomial Graph 


The size of the Binomial 
nodes approaches N as n> ©. 


Corollary 2: 


Graphs with N=27 


III. Remarks 

We have investigated several variations and 
extensions of the proposed graphs. For example 
it can be shown that graphs with maximum degree 
5, diameter logoN, connectivity 2 and size 
approximately 1.3N can be constructed. We have 
also constructed graphs with connectivity 
greater than 2. 


The regular graphs studied earlier [1-13] 
other than loop and binary n-cube lack structure 
for conveniently adding and deleting nodes. 

Loop has the best properties for this problen, 
since deleting or adding a single node only 
perturbs at most two links. A network based on 
a binary n-cube can be made to have such 
properties by assigning longer labels, such that 
many labels are possibly unused. Then one can 
connect new nodes and delete old nodes without 
perturbing more than approximately log,N 
connections. The modularity properties of the 
proposed graphs are good. For example many 
nodes (3/4 nodes in Class I graphs and at least 


Cra/2p> nodes in Binomial Graphs)in the 


proposed graphs can be deleted by perturbing at 
most two links. Many nodes can also be added 
while increasing the diameter by at most l. 
These results for Class I graphs are reported in 
[17]. 
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COMPARISON BLIWEEN RINARY N- CUBES, LOOPS AND THE PROPOSED GRAPHS 


PROPOSED GRAPHS 
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Figure 6: Binomial Graphs with N = 16,32 
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Figure 5: Binomial Graph 


DENSE BUS CONNECTION NETWORKS 


Karl W. Doty 
Systems Control Technology, Inc. 


Palo Alto, California 


Abstract -- In designing interconnection 
networks, two important criteria are minimiza- 
tion of the distance" between nodes and minimi- 
zation of the number of interconnections that 
must be made among the nodes. When direct node- 
to-node connections are modeled, one problem 
which has been extensively examined is maximiz- 
ing the number of nodes subject to restrictions 
on the maximum internode distance and the 
maximum number of connections per node. This 
paper explores the same problem when more general 
bus-like connections among more than two nodes 
are allowed. New designs, including a general- 
ization of de Bruijn networks, are presented 
with more nodes than previously proposed designs. 


Introduction 


There are many important factors involved in 
the design of a computer interconnection network. 
The distances between processors should be mini- 
mal, in order to have short delay times. 

Physical limitations usually restrict the number 
of connections for each processor. Network 
performance should not be radically affected by 
processor or link failures. Cost considerations 
may reduce the possible number of connections. 
Appropriate tradeoffs need to be made among 
these factors. 

A graph theory model of a computer network, 
with nodes representing processors and edges 
representing links, has been extensively used to 
look at some of these factors. One problem which 
has been examined by several authors is the 
problem of finding the graph with the most nodes, 
given constraints on the graph's degree (the 
maximum number of nodes adjacent to one node) and 
diameter (the maximum number of edges which must 
be traversed between two nodes) [1-3]. In some 
computer systems, however, the model of having 
two nodes connected by an edge is not very 
realistic. Instead, several processors may be 
connected by a bus and may be thought of as being 
equally distant from each other. At the same 
time, each processor may be connected to several 
buses. The result is a bus connection network. 

The same factors are relevant in the con- 
struction of bus connection networks as in the 
construction of standard networks. We will focus 
on the property of the maximum distance between 
processors, or the diameter. Specifically, we 
will be looking for bus connection networks with 
as many nodes as possible as a function of the 
diameter, subject to constraints on the number of 
connections for each node and bus. Such networks 
should have smaller maximum communication delays 
than networks of comparable size. 


A Bus Connection Network Model 


We will use a model discussed by Mickunas [4] 
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for a bus connection model. In this model, each 
node is incident on a certain number of buses, 
which we will call the node's degree. The 
maximum of the degrees of the nodes will be 
called the graph's nodal degree. Each bus has a 
certain number of nodes on it, which will be 
called the degree of the bus. The maximum of 
the degrees of the buses will be called the 
graph's bus degree. For notation, we will 
denote the nodal degree by d, the bus degree by 
b, the diameter by k, and the number of nodes 
by n. 

Two nodes will have a distance of one if 
they are incident on a common bus. In general, 
the distance between two nodes is the minimum 
number of buses which must be passed through to 
get between them. 

We will use a representation of a bus 
connection network as a bipartite graph. One 
type of node represents the original nodes, 
while the other type represents the buses. In 
the figures in this paper nodes are filled-in 
circles and buses are empty circles. An edge 
in a figure represents a node incident on a bus. 


Moore Graphs for Bus Connection Networks 
For standard graphs, the bus degree b = 2. 
In most of this paper we will assume b > 2, 

since the other case is covered elsewhere [1-3]. 


When b = 2, the maximum possible number of 
nodes a graph can have is 


Sey 
foo 


A graph with this number of nodes is called a 
Moore graph. 

An analogous concept can be defined for the 
case of b> 2, From each node at most’ d buses 
can be reached, from which at most d(b-1) nodes 
can be reached. Thus the maximum number of nodes 
in a diameter one graph is 1 + d(b-l). similarly 
in exactly two steps at most d(d-1)(b-1)” nodes 
can be reached, and in exactly j steps at most 


acer * iy) 
define n(d,b,k) 


nodes can be reached. If we 
as the maximum number of nodes 


in a graph with nodal degree d, bus degree b, 
and diameter k, we have 
k k 
n(d,b,k) <1+ d(p-1) S45) (b=) = 


(d-1) (b-1) - 1 
A graph with this number of nodes is called a 
Moore geometry. 

Since there are very few Moore graphs it is 
natural to ask if there are any Moore geometries. 
A Moore geometry with diameter 1 has 1 + d(b-1l) 
nodes. The simplest case here is where d = b 
(the node degree equals the bus degree), so 
there are d - d+ 1 nodes and the same number of 
buses. Every pair of nodes is on exactly one 
common bus. The existence of such a graph 


depends on the existence of a mathematical object 
called a finite projective plane, a subject which 
has been extensively examined. It can be shown 
that such a plane exists if d-l is a power of a 
prime number [5]. 

For d # b, diameter 1 Moore geometries can 
be based on balanced incomplete block designs, or 
BIBDs. A design arranges n objects (analogous 
to the nodes) into v blocks (analogous to the 
buses) so that every object is in d_ blocks, 
every block contains b objects, and every pair 
of objects occurs together in exactly one block. 
Several examples of BIBDs are given in [6]. 

The question of the existence of Moore 
geometries with diameters greater than one is 
more complicated. None are known if b > 2. 
and Dowling [7] give necessary conditions for 
existence when k = 2, although they could find no 
graphs satisfying those conditions. Fuglister 
[8] showed that there are no Moore geometries 
with k = 3. Another result which is easily 
proved is that there are no Moore geometries with 
d = 2 and b > 2. 


Bose 


Previously Proposed Bus Connection Topologies 


Several construction methods for bus connec- 
tion networks have been proposed in the litera- 
ture. Perhaps the simplest bus connection network 
is a hypercube. Such a network can be thought of 
as a cube in d-dimensional space. Each row, 
column, etc. of nodes corresponds to a bus. 
node is on d_ buses, and each bus has 
The graph has a total of b nodes. 

A structure proposed by Wittie [9] is called 
a dual bus hypercube. This is a hypercube with 
many of the buses removed. Each node is connec- 
ted to two buses. One bus can be thought of as a 
line in the same direction for all nodes. The 
other bus is in the same direction for all nodes 
in a single plane. A node corresponds to a 
vector of the form (x_,, x +5 Xs z) where the 


Each 
b nodes. 


A: 2’ 
x's a z take values from 1 to b. Thus there 
are b nodes. One bus that each node is on 


connects all nodes with the same, z values and the 
same x values except one--the z The diameter 
of the graph is 2b. Other proposed bus connection 
networks include the 'snowflake" and "star" graphs 
introduced by Finkel and Solomon [10]. 


New Bus Connection Network Topologies 


The first method we will consider is a 
generalization of the hinging graphs of Friedman 
[11]. In a standard hinging nodes are arranged 
in a hierarchical fashion, then the nodes at the 
bottom level of several hierarchies are connected. 
These networks are bipartite so we can call one 
class of the original nodes to be the new "nodes" 
and the other class the "busses." The graphs can 
be generalized so that the two classes have 
different degrees. As an example, Figure 1 shows 
a hinging for k = 3, d = 3, b = 4, and n = 40. 

It can be proven that it is optimal (in the sense 
of having more nodes) for the object (node or bus) 
with the higher degree to be on the hinge. 
Formulas can be easily derived expressing n as a 
function of d, b, and k, in which 
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a ring on n nodes, 


jk/2 


Bm OC bCdal) (bab) 


Figure 1 
Example of a Bus Hinging Graph 


When d = b, large networks can be found by 
using chordal rings [1]. These are a general- 
ization of the chordal rings of Arden and Lee 
[12]. A chordal ring is a graph which begins as 
then chords are added 
connecting additional pairs of nodes. The 
chordal lengths form a pattern, which repeats 
around the ring. Again. we need a bipartite 
structure. With an even number of nodes, the 
ring part is naturally bipartite. In order to 
insure the graph remains bipartite, all chord 
lengths must be odd. Table 1 lists the 
characteristics of a few of the chordal rings 
found by the author which can be used for bus 
connection networks. In most cases these are 
larger than the graphs which have been produced 
by other methods. An example with 24 nodes and 
buses, degree 3, and diameter 2 is shown in 


Figure 2. The pattern length for this graph is 8. 
Degree Diameter Number of Nodes Moore Bound 
3 2 24 oak 
3 3 75 127 
3 4 180 S11 
3 5 455 2047 
4 2 12 121 
5) Z 128 341 
6 2 242 781 
7 I, 35 43 
Table 1 


Chordal Rings for Bus Connection Networks 


A Generalization of de Bruijn Networks 


In this section we introduce a new contruc- 
tion which generalizes the de Bruijn networks [2]. 
In a de Bruijn network, each node is associated 
with a vector of k components, each being one of 
the integers between 1 and d/2 (k is the graph 
diameter and d is the degree, which is assumed 
to be even). The node (vy > Vo» tis vi is 


connected to all nodes of the form (x, Vy Vo» 
bees v.-p and all nodes of the form (v5 » Vas 


+5 Vps x), where x is an integer between 1 and 


€ 
Figure 2 
Example of a Chordal Ring Bus Network 


d/2. The graph has (a/2)* nodes. 

In order to generalize this to bus connec- 
tion networks, let us examine the node connection 
rule more carefully. All nodes of the form 
Vass Vas nee. _1? x) are connected to all nodes 
of the form (y, v,, ..., ae ). With the bipar- 
tite representation of graphs, the interface 
between two of these sets of nodes with d = 6, 

k = 3 looks like Figure 3. When higher degree 
buses are allowed, the appearance of the inter- 
face will change. The goal will be to choose 
the interface so that the number of nodes per 
set (the number of values per digit), which will 
be denoted h, is as large as possible. This is 
because the total number of nodes in the graph 
is h . In the standard case, h = d/2. An 
example of a more complex interface is shown 

in Figure 4 for d = 4, b = 6, k = 3. 


Figure 3 
Example Interface--Standard de Bruijn Graph 


Figure 4 — 
Example Interface--Generalized de Bruijn Graph 


The same form of construction as in Figure 
4 works whenever d and b are both even. We 
will have h = (d/2)(b/2), and (d/2) buses for 
each interface. The first b/2 nodes on the 
"bottom" level of an interface (the ones whose 
vectors will be shifted forward) whould be connec- 
ted to the first d/2 buses, the second b/2 nodes 
to the second d/2 buses, etc. The first b/2 
nodes on the "top" should be connected to those 
buses which, if numbered left to right, are 
congruent to 1 (mod d/2).. The second should be 
connected to those congruent to 2 (mod d/2), etc. 
This makes every bottom node connected to every 
top node via a bus. A similar construction is 
used if either d or b is odd. Notice that the ie 
generalized de Bruijn graphs are of ;gize (db/4). 
This is larger than 0O([(d-1) (b-1) ] ) when 
db > 16. The generalized de Bruijn graphs also 
inherit all of the advantageous properties of the 
Standard de Bruijn graphs. For example, a 
simple algorithm allows one to bypass a faulty 
node by taking only four additional steps. 
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Abstract: Recent developments in VLSI have made it 
feasible to interconnect large numbers of single chip 
computers to form a multimicrocomputer network. 
Using task precedence graphs to represent the time 
varying behavior of parallel computations, we investi- 
gate the performance of interconnection topologies for 
multimicrocomputer networks. 


Introduction 


Current evolutionary trends in integrated circuit 
fabrication suggest that it will soon be cost effective to 
consider a new parallel processing paradigm based on 
large networks of interconnected single chip comput- 
ers. The single VLSI chip comprising each network 
node would contain a processor with a modicum of 
locally addressable memory, a communication con- 
troller capable of routing internode messages without 
delaying the processor, and a small number of connec- 
tions to other nodes. 


Among the suggested application areas for these 
multimicrocomputer networks are partial differential 
equations solvers and divide and conquer algorithms. 
The cooperating tasks of a parallel algorithm for solving 
one of these problems would execute asynchronously 
on different nodes and communicate via internode mes- 
Sage passing. The limited node fanout implied by the 
VLSI implementation, as well as the absence of shared 
memory, make it crucial to select an interconnection 
network capable of efficiently supporting message pass- 
ing. In this paper we discuss computation paradigms 
for multimicrocomputer networks and techniques for 
assessing the performance of network interconnec- 
tions. 


Models of Computation 


In one view of parallel computation, all parallel 
tasks are known a priori and are statically mapped 
onto the network nodes before the computation begins. 
In this case, queueing theoretic models can be used to 
estimate the performance of a given multimicrocom- 
puter network executing a particular algorithm [4]. 


In the alternate view, a parallel computation is 
defined by a dynamically created task precedence 
graph. Tasks are created and destroyed as the compu- 
tation proceeds, and the mapping of tasks onto network 
nodes is done dynamically. 


Because most queueing theoretic models assume 
steady state behavior, they are not generally applicable 
to study of time dependent parallel computations, In 
particular, models of time dependent computation 
must account for time varying workloads, distribution 
of data to multiple tasks, and dynamic mapping of 
tasks onto network nodes using only partial knowledge 
of the global network state. Because we know of no 
analytic technique capable of accurately representing 
this behavior, we have adopted simulation as a means 
of study. 
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Subsequent sections of this paper present five mul- 
timicrocomputer interconnection networks, outline a 
task precedence model of time dependent computa- 
tion, and discuss the results of a parametric simulation 
study of these interconnection networks when support- 
ing time dependent computations. 


Interconnection Networks 


Because of the computational expense of simula- 
tion, we limited our study to five interconnection net- 
works that earlier analysis suggested were worthy of 
further investigation: the e-D spanning bus hypercube 
[6], 2-D toroid [4], cube-connected cycles [3], 2-ary N- 
cube [1], and the complete connection. We have 
included the complete connection to determine the 
performance degradation attributable to incompletely 
connected networks. 


Task Precedence Graphs 


As stated earlier, our model of time varying com- 
putation is the task precedence graph. A precedence 
graph represents a computation as a series of depen- 
dencies. The results of all computations providing 
input to a task, its antecedents, must become available 
before the task is eligible for execution. 


In each precedence graph, three types of tasks 
can be distinguished: fork tasks, join tasks, and regu- 
lar tasks. A fork task has a single antecedent task and 
one or more consequent tasks; it represents the com- 
putation prior to initiation of parallel subtasks to solve 
a problem. A join task has one or more antecedent 
tasks and a single consequent task; it represents the 


combination of subproblem solutions to yield a solution 
to an entire problem. Finally, a regular task is any 
task that is not a fork or join task; it represents a sim- 
ple computation. If we interpret the juxtaposition AB 
of tasks to mean "A is an antecedent of B"", a task pre- 
cedence graph can be formally defined by the following 
grammar. 


<precedence graph> ::= <regular task> | 


<fork task> <precedence graph>* <join task > 


As summarized in Table I, the characteristics of a 
precedence graph are determined by several parame- 
ters. Because the number of possible graph parame- 
terizations is so large, we have somewhat arbitrarily 
selected a set of values, also given in Table I, to be used 
as a reference point in our study. By systematically 
varying subsets of these parameters, we obtain 
different performance results. By comparing these 
results to those obtained using the reference parame- 
ters, we can estimate the effect of the variations. 


simulation Methodology 


For comparative purposes, we generated twenty 
five task precedence graphs using the reference 
parameters shown in Table I. All service times were 
drawn from negative exponential distributions, the 
number of consequents of each fork task was uniformly 
distributed between Byin and Bmag,z, and all graphs were 


constrained to have between Maxtasks / 2 and Maztasks 
tasks. Each node was assumed to possess complete 
knowledge of the network state, and each task eligible 
for exécution was scheduled on the idle node nearest 
its location. We will return to this assumption when dis- 
cussing distributed scheduling algorithms. Finally, to 
model the fact that each network node is a single chip 
with fixed bandwidth, we scaled the mean data com- 
munication times by the number of link connections to 
each node, 


The average parallelism P attained when evaluat- 


ing a precedence graph on a network has been taken as 
the measure of performance. This is 


Numiasks 
3 
4 
ss eek eee ee 
paraltiel execution time WHEE: Oy: EL Spy OR neg) 
simulation Experiments 
Using the assumptions discussed above, we 


explored five different variations of precedence graph 
parameters and network characteristics and their 
effect on network performance: precedence graph 
structure, the event horizon of a distributed task 
scheduler, the maximum task branching factor, the 
mean computation time/communication time ratio, 
and the number of network nodes. The first two of 
these are discussed below; an analysis of the other vari- 
ations can be found in [5]. 


Precedence Graph Structure 


Figure I shows the graph parallelism when each of 
the twenty five graphs derived from the reference 
graph parameters was simulated on the five networks 
with 64 nodes. The precedence graphs were sorted in 
increasing order of parallelism on the complete con- 
nection. Table II shows the average parallelism over 
the set of graphs using each network. 


Two features of Figure I are of particular interest. 
The first is the way networks other than the complete 
connection exhibit the same performance trends from 
precedence graph to graph. This suggests that some- 
thing inherent to the graphs is affecting the time 
required for their evaluation. To determine what this 
might be, we examined two precedence graphs, 
numbers nine and eleven in the figure, that 
represented two extremes of behavior. Figure II shows 
the time varying parallelism when the two graphs were 
evaluated on a 2-D toroid with 64 nodes. The simula- 
tion of precedence graph nine exhibits a striking 
decrease in the number of parallel tasks near time 90. 
Because a similar simulation on the complete connec- 
tion exhibits no such decrease, we can only conclude 
that this variation is caused by the collapse of a paral- 
lel subgraph requiring the transmission of results 
across several communication links. During the delay 
caused by this transmission, tasks otherwise eligible for 
execution were forced to wait for these results. 


Figure I also points out the performance 
differential between the spanning bus hypercube and 
the networks using dedicated links. Although, this 
behavior may appear somewhat anomalous in light of 
the apparently greater communication bearing capa- 
city of the dedicated link networks, this is not the case. 
A detailed examination of the simulation results shows 
that tasks generally execute on nodes near their point 


of origin. In other words, the precedence graph evalua- 
tion exhibits considerable communication locality. For 
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this communication pattern and the given ratio of com- 
putation time to communication time for tasks, the 
utilization of the communication links is low. Because 
of this, the buses of the spanning bus hypercube permit 
more rapid distribution of tasks to other nodes than 
the dedicated links of the other networks. For the 
same reason, distinct differences among the dedicated 
link networks are also not apparent. 


Event Horizon of a Distributed Scheduler 


Heretofore we have assumed that the task 
scheduler at each node always possesses complete 
knowledge of the global network state. In practice, 
only limited information is available, and it is often no 
longer completely accurate when it is received. 


To determine a scheduler’s operation in the face of 
partial knowledge, we postulated the existence of an 
event horizon for each network node. We assume the 
scheduler at each node has no Knowledge of network 
activity at any nodes beyond its event horizon and that 
it must schedule all eligible tasks on nodes within its 
event horizon. Using the reference precedence graph 
parameters, Figure II] shows the average graph paral- 
lelism as a function of the distance to the event horizon 
from a node. Similar results are obtained when the 
ratio of computation times to communication times 
varies from 1:1 to 100:1. Based on this limited evi- 
dence, it appears that state knowledge of nodes within 
a small distance from each source node is sufficient to 
achieve reasonable results. This is encouraging 
because it suggests that. efficient’ distributed 
schedulers can be constructed for multimicrocomputer 
networks. 


Two final observations about distributed 
schedulers should be made. First, this dynamic 
scheduling strategy does not use the precedence graph 
structure to aid its decisions. It should be possible to 
design heuristics that take advantage of some graph 
specific information. : 

second, the acquisition of state information from 
nodes within an event horizon is decidedly more 
difficult for networks connected by buses than for 
those using dedicated communication links. This is pri- 
marily because so many more nodes are within a small 
number of bus crossings from a source node. Commun- 
icating state information to other nodes on the same 
bus could conceivably consume a significant portion of 
the available communication bandwidth, Additional 
work is needed to determine the cost of acquiring state 
information. 


summary 


We have presented a model of time dependent 
parallel computation and studied the behavior of five 
multimicrocomputer interconnection networks sup- 
porting computations similar to those of the model. 
Among the issues considered were the relative perfor- 
mance of interconnection networks and the efficacy of 
distributed scheduling using incomplete information. 


For small, dynamically created tasks, the spanning 
bus hypercube appears to have better performance 
than the dedicated link networks because it can diffuse 
work more rapidly. This is not always true; if message 
routing does not exhibit enough locality (i.e., messages 
must cross many links to reach their destination), the 
smaller communication bearing capacity of the span- 
ning bus hypercube will be saturated, and the dedi- 
cated link networks will be preferred. Clearly, the 


selection of a network must be made with knowledge of 
communication patterns and task sizes required by an 
algorithm. 


Finally, dynamic task scheduling using only local 


information seems successful for the class of algo- 
rithms represented by precedence graphs, suggesting 
that efficient distributed schedulers can be designed. 
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Table I Precedence graph parameters 


| | Reference 
Quantity Definition Value 


minimum number of conse- 1 


quents of a fork task 
maximum number of conse- 4 
quents of a fork task 
mean data communication 1 
time to initiate a fork or regu- 
lar task 
Cy mean data communication 1 
time to initiate a join task 
Maxpath maximum length path through 60 
the graph 
Numtasks | number of tasks in the graph 1024 
Sp mean fork task service time 10 
Op mean regular task service time 10 


Table II 
Average graph parallelism for 64 node networks 


using the reference precedence graph parameters 


Network Average | Fraction of 


paratlelism | complete connection 


Cube-connected Cycles 15.33 0.69 
a-ary 4-cube 15.42. 0.70 
2D Spanning Bus Hypercube 18.04 0.81 
2D Torcid 14.75 | 0.67 | 
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Figure I 
Graph parallelism for 64 node networks 
using the reference precedence graph parameters 
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Figure II 
Time varying parallelism for two 
precedence graphs on a 64 node toroid 
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Figure Iil 
Average parallelism for 64 node networks 
with varying amounts of scheduler information 
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Abstract 


This paper presents a method for emulating multiprocessor 
architectures on Cm*, a 50 processor multiprocessor at Carnegie 
Meiton University. Combined with other instrumentation tools 
already developed at CMU the result is a flexible testbed for 
multiprocessor architecture evaluation. An experiment — to 
demonstrate the usefulness of this testbed is presented along with 
results for three architectures: ring. nearest neighbor and_ fully 
connected. These results are used to show how the testbed could be 
used to aid in multiprocessor design. It is shown that for this 
yaruicular real-time applicaton a three processor fully connected 
Structure provides more usable compute power than a six processor 
ring. 


Introduction 

Large computational problems such as weather forecasting, fusion 
modeling, add aircraft simulation demand compuung power far in 
excess of that supplied by uniprocessors. ‘Thus researchers have been 
secking alternative architectures to solve these pressing problems. 
Many of these proposals invelve the construction of multuprocessors, 
systems tn which a shared address space is provided for 
internrocessor communication and synchronization. A very popular 
jorm oF muluprocessor is the Multiple Instruction stream, Multiple 
Data stream (MIMI)) computer [5]. 


All multiprocessors require an interconnection mechanism which 
physically implements the shared address space. Numerous proposals 
for such structures appear in the literature, covering a wide range of 
performance. cost, and reliability [2, 4.6, 15]. Unfortunately few 
working multiprocessors have actually been built. so the principle 
sources of comparative information come from modeling, 
simulations, and educaicd guesses. Due to the complexity of parallel 
program interaction, the conmiplexity of modeling the hardware 
systems and the large size of the design space, much of the results to 
date ure inadequate. 


Yo help alleviate this situation, the construction of a 
multiprocessor test bed which could closely emulate a number of 
architectures would be desirable. Several testbeds are currently 
under developinent by other researchers, such as the network testbed 
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at Mitre [3] and the multiprocessor testbed at TRW [8]. With a 
versatile emulator. accurate measurements of the performance of 
these architectures under both synthetic and actual workloads could 
be obtained. Fortunately, as will be shown. Cm* [13. 14] can function 
as such a test bed and be used to answer some of the questions 
plaguing multiprocessor research. 


The message passing portion of the Medusa operating system [9] is 
Quite suitable fer emulation of alternate architectures through 
sofware routing schemes. The time cost of the routing software 1s 
eenerally less than the une to send the message, as would be the case 
in areal system. To see how useful such an emulation system might 
be. an experiment to model the communications demands of an 
actual multiprecessor was implemented. Four versions. onc using full 
interconnection directly implemented with Medusa communication 
vrimitives, one using emulated full interconnection, one using 
emulated nearest neighbor communication. and one using an 
emulated ring were built. The ability to remove unwanted Cm* 
characteristics and quantify emulation overhead was demonstrated. 
To distinguish the different experiments, the direct implementation 
will be refered to as the basic experiment, while the other three as 
direct connect, nearest neighbor and ring experiments. 


Description of the Cm* Testbed Environment 


The testbed described in this paper is being developed on Cm*, a 
50 processor multiprocessor operating at Carnegic-Mcllon University 
[13,14]. The processors are Digital Equipment Corporation LSI-11’s 
arranged in a two level hierarchy of Computer Modules and Clusters, 
as shown in Figure 1. The custom built communications mapping 
processors (Kmaps) handle interprocessor memory references and 
provide low level operating system support. For simple memory 
references between Cms within the same cluster access times are 
about a factor of three longer than those of intra-Crm geferences, 
while between cluster references take nine umes as long. For 
messages the distinction between intercluster and intracluster transit 
times disappears, as will be demonstrated shortly. 


All of the experiments were performed using the Medusa 
operating system. In Medusa, an experiment consists of a group of 
cooperating activities (processes) which are termed a task force. Each 
activity is assigned to a Computer Module (Cm), in which it alone 
executes, and given its own copy of code. Communication between 
activitics can be by shared memory, with the shared memory resident 
on anv Cm in the system, or by message passing. using a block 
transfer mechanism implemented in Kmap microcode. Messages are 
passed through Pipes which queue up waiting messages in FIFO - 
order, up to the capacity of the Pipe. In this respect they are much 
like Unix Pipes, except that they maintain the identity of cach 
message and will only deliver them as scparate units. Routines 
running on the I.SI-11s invoke Kmap operations to send and receive 
messages, with the choice of suspending execution. when an 
operation cannot be completed or quiting with an crror indication. 
The experiments make use of these facilitics to emulate a 
multicomputer system. 
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Figure 1: Cm* Multiprocessor Structure! 


In emulating different architectures it is desirable to eliminate any 
biases caused by the underlying system. Cm* has a hicrarchical 
communication structure which can cause variations in system 
performance depending on the relative Jocations of the 
communicating processes. It turns out though that the Medusa 
message passing system exhibits approximately constant delays, 
regardless of subtask location, provided the Pipes are properly 
located, us discussed below. 


In the Mcdusa message system, greatest message throughput is 
obtained when the Pipe is on a different cluster than the Sender and 
Reeeiver subtasks. This surprising result was discovered while 
measuring message system throughput for the emulations. It appears 
to be an effect of Kmap contention caused by the heavy processing 
load placed on the Kmaps by the conimunication mechanism. 
Placing the Pipe on a different cluster distributes this load over 
several Kmaps resulting in the observed speedup. If several messages 
are being transmicted by the system simultancously there may be less 
impact from distributing the workload. but the cffect has been 
observed even under those conditions. 


I-igure 2 shows the effect of placing the Pipe on the same cluster. 
For all three curves in the figure the Pipe was on cluster one. The 
cluster one transfers were considcrably slower than those on clusters 
two and three. [f the cluster one transfer measurements are redone 
with the pipe on another cluster the transfer rate is identical to those 
of the other two clusters. There is a small decrease in transfer rates 
when the Sending and Receiving processes are both on the same 
cluster, as shown in Figure 3. The decrease is not significant, 
however. If Pipes are properly placed, a nearly flat communication 
structure results, providing a base for accurate emulation. 


Multiprocessor Emulations Possible on Cm* 


Vhe space of possible multiprocessor designs is large. Dimensions 
on which multiprocessors may vary include the number of 
independent instruction streams, speed of processors. bandwidth of 
interconnection links, and degree of connectivity. he specific 
attributes of these dimensions as regards to Cm* determine the space 
of multiprocessors which the Cm* testbed can emulate. 


A major area of rescarch in multiprocessor architectures involves 
the design and evaluation of the interconnection mechanisms used in 
them. A few of the possibilities include multistage networks such as 
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the Augmented Data Manipulator [1] and Omege [7]. point-to-point 
networks such as Cube Connected Cycles [10] and nearest neighbor 
meshes, and shared bus networks, such as the n... sh scheme originally 
proposed fer Cm* [12]. Interprocessor communication may be 
through direct memory references or through explicit messages. The 
multistage networks tend to favor direct memory references or short 
messages while the more sparsely connected point-to-point networks 
favor ions messages with store and forward operation at the nodes, 
Since ihe testbed described in this paper uses software routing and 
explicit message passing, it is better suited to emulating the point-to- 
point networks and certain types of shared bus networks. However, 
special cases of mulustage networks may also be feasible to emulate. 
Some exampies of the muliprocessor interconnection networks 
suitable for emulation on Cm* are shown in Figure 4. Research is 
continuing on developing a testbed which can accurately emulate the 
multistage networks as well. 


Since Crn* is a MIMI) machine with asynchronously operating 
computer modules, it would be inappropriate to attempt to emulate 
multiprocessors without those features. In normal operation code 
and local data are kept in the memory of the computer module using 
them and are accessed directly rather than through the 
interconnection network. While it is possible to force all aceesses 
through the network, it is probably preferable to concentrate 
experimentation on architectures which feature the same computer 
module structure. The sesuits reported in this paper all assume local 
access to code and non-shared data. 
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Figure 3: Message transfer, Intercluster vs Intracluster 
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Figure 4: ‘Typical Interconnection Networks suitable for 
Cin* Testbed Emulation 


Methods used to Emulate 
Multiprocessor Architectures 


The emulation package was implemented by adding a subroutine 
package to cach of the subtasks to perform message routing and 
delivery. In this scheme cach subtask is only allowed to communicate 
with logically adjacent subtasks as determined by routing tables. 
Messages for nonadjacent subtasks have to be forwarded by 
intermediate subtasks. 


Subtasks communicate through a sect of virtual buffers 
corresponding to the physical buffers used in the basic experiment. 
Messages sent over these buffers are routed by information 
coniained in the first word of the message. ‘This header word contains 
source, destination and logical butfer indices. Routing tables 
computed partly at compile time and pardy during initialization 
determine the exact path taken by cach message. 


‘The actual operation of the message system starts when a subtask 
calls the SNIOMUSS subroutine in the routing package to send a 
simulated message. This routine computes the routing word for the 


message and calls FRWRDMSS to initiate its transport through the’ 


network. KF RWRDMSS deicrmines the appropriate physical output 
buffer and invokes the Kimap to send the message through it. Each 
subtask repeatedly calls the CHIKBUFS routine to determine if any 
messages have arrived and to deliver or forward them as appropriate. 
If the message is for the local subiask INCRMESS is called to 
increment a received message counter and check for overflow. A flow 
control system is previded to prevent deadlock which limits the 
nu:nber of messages in transit through the emulated interconnection 
networks to a safe number. Buffer overflow is determined by 
examining the message counts for each virtual buffer maintained by 
INCRMESS. Buffer overflow in this scheme is analogous to buffer 
overflow in the basic experiment. 


Using the emulation package described above, a large variety of 
interconnection structures can be modeled by simply changing 
paramcters in the routing tables. Since real messages are passed 
between processors. actual working programs can be uscd to test the 
structures. This allows the data dependencies often associated with 
real multiprocessor algorithinis to be fully reflected in the observed 
interconnection structure behavior. 
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Description of Modeled System 


The experiments consist of simelating the high level behavior of a 
real-time = imultiple computer on different emulated 
interconnection structures. ‘The actual system consists of three 
clusters of minicomputers, with four processors in each cluster. 
Within each cluster communication is through shared’ memory, 
making each cluster a small raultiprocessor. Between clusters point- 
to- point communication paths are used, resulting in a multicomputer 
Structure, 


system 


The multicomputer system is driven by inputs from external 
sensors. Its outputs consist of status displays and actuators: The three 
computer clusters perform distinct functions in the overall system 
from which their mnemonics are derived. The Actuator Control 
System (ACS) has primary control over the mechanical devices used 
in this system. A second computer cluster, termed the SPU, controls 
a signal processing unit. The final cluster handles overall control and 
information display, carning it the title of Control and Display 
(C&D). 


Yhis isa real-time system which must respond immediately to any 
significant event. Thus there are minimum throughput requirements 
as well as constraints on the maximum latency of certam operations. 
The object of the experiment is to emulate the system behavior, with 
Cm* used as a test bed comparing the performance of different 
architectures. The performance is measured by varying the synthetic 
workload (which simulates the high Ievel system behavior) and 
message length while determining the maximum sustainable 
comiiunicalions rate. 


‘these experiments were based on a high level description of the 
system. ‘Phe communications requirements at the multicomputer 
level, are shown in Figure 5. the requirements included infortnation 
on the average message rate between clusters and external deviccs, 
between different clusters, and the allowed minimum response time 
for certain high level activities. The response time limits proved to be 
difficult to measure with the present level of Cm* instrumentation. 
However the Message Event generator provided on Cm* does 
facilitate the gencration of messages at fixed time intervals which can 
be used to stress the communications system. ‘This generator is used 
to simulate external inputs to the control system which would 


DATA BITS WORDS MESSAGES DATA BITS 
FLOW IN A IN A PER TRANSFERRED 
DIRECTION WORD MESSAGE SECOND PER SECOND 
C&D -- SPU 
C&D out 32 256 2 
C&D in 32 256 14 os 
131K — 
ACS -- SPU 
ACS out 32 138 49 
ACS in 32 132 49 
424K 
C&D -- ACS 
ACS out 32 256 16 
ACS in 32 256 16 | 
- 262K 


Figure 5: Communication Rates in Multicomputer System 


normally determine total system workload. Failure to meet real-ume 
requirements was indicated by message buffer overflows somewhere 
in the system. 


Description of Experimental Methocology 


The actual experiment consists of a task force conmposed of one or 
more processors representing cach multiprocessor cluster, plus 
several support activities. The experiment utilizes resources from the 
Synthetic Workload Generator system [LT] and was intended to be 
nnplemented with ws "B" Janguage. In this environment, Pipes are 
termed buffers and activities termed subtasks. Phe support subtasks 
consist of the Pegasus user interface, the message event generator and 
a monitoring routine. Phe user interface provides communication 
with the users terminal, controls operation of the micssage event 
generator, and allows control of user specified variables in the user’s 
subtasks. The message event generator inonituis a real-time clock 
and sends short (one word) messages to speerfic subtasks at specified 
intervals. The monitor subtask has an array of integers where other 
subtasks may record the occurrence of errors. It repeatedly scans 
these integers for evidence of changes and reports them io the user. 
In the present experiment these integers record the occurrence of 
message buffer overflows, as detected during attempted message 
sends. Using shared memory instead of short error messages is a 
much less expensive way of communicating this error information, 


Vhe modeled version of the Real-tume system is shown in Figure 6. 
The simulated workload is driven by the messuge event generator, 
which periodically sends event tokens to the SPU to initiate a unit of 
system activity by triggering an SPU to C&D message. This message 
in turn generates other messages in such a manner that the average 
communication rate on the data paths in the system is similar to that 
experienced by the real system. This is done by gencrating new 
messages to send to other nodes under the control of a random 
number generator. Phe C&)) subtask will gencrate two new messages 
for each message received from SPU with 7% going back to SPU. 
S7% to ACS and 36% going nowhere. C&D messages which arrive at 
the ACS subtask spawn four more messages, 25% of which return to 
C&l) and 75% of which continue on to SPU. The SPU messages 
menerate messages back to ACS, where they are terminated. The 
resulting message rates are shown in parenthesis in the figure, and 
correspond well to actual system rates. 


In the real system the stated message traffic is only an average and 
varies with actual workload and computadion patterns. This effect is 
simulated by using a uniform random number gencrator to 
determine the routing of messages whenever there is more than one 
possible destination. ‘The generator provides a degree of random 
clustering of messages such as a real system might sec, allowing the 
effects of. buffer size and message Iength to be observed. 
Distributions other than uniform could be obtained if a particular 
experiment required them. 


When a message is reccived by a subtask it performs some 
simulated work corresponding to that which the actual system would 
have to do. This simply consists of executing a null loop a number of 
times as requested by the experimenter through the Pegasus user 
interface.. The number of loops per received message executed by 
cach subtask is scaled by the average number of messages received 
per second so that each subtask executes the same average amount of 
work. For example. when actual system message rates are used, the 
SPU subtask receives an average of 50 messages a second and 
executes 256 loops for each message while the ACS subtask receives 
64 messages per second and executes 200 loops per message. Thus 
both subtasks execute 12800 loops per second. ‘The random number 
schemes could also be used here to provide some variability if 
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Figure 6: Representation of Data Flow Used in the Experiment 


desired. The experiments were conducted over a range of such 
workloads, with all subtasks subject to approximately the same load. 


The maximum sustainable message rate for cach combination of 
interconnection structure and synthetic work load was determined by 
repeated runs at increasing intervals between message events until a 
sustained period of operation in which no buffer overflows occurred 
was observed. Processing power was varicd by varying the number 
of processors per subtask. Workload per message and message length 
were also varied for cach structure in order to aid in characterizing its 
performance. With cach processor execuling the same amount of 
work. any overflows recorded were duc to message system saturation. 
The communication abilities of the direct connect, nearest neighbor 
and ring neiworks were then compared. 


Results 


The first experiment consisted of the basic point-to-point, fully 
connected network similar to that used in the actual system. One 
subtask was used to represent each computer cluster of the system. 
Three different message lengths were used, one-half. one-fourth, and 
one-eighth of those specified in the actual sysiem. Unfortunately the 
full message length proved to be too much for Cm* to handle, and so 
was deleted from the experiments. ‘The system was deemed to have 
saturated when any message buffer overflowed? Message system 
saturation periods for six different synthetic work loads were found 
for each message length. The results, averaged over three successive 
runs. are shown in Figure 7. Note thar the three curves are 
essentially linear. However, small valucs of simulated work show a 
disproportionately longer period between messages. due to the 
increased significance of message activity. As message Iength 


Usually the SPU to WCS buffer overflows first, probably because it is the busiest 
buffcr and the WCS subtask receives the most total messages. 
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Figure 7: Saturation Curves for the basic Expcriment 


decreases (decreasing scale factor) the plots become linear even at 
low synthetic work load, again indicating the reduced effect of 
messages. ‘The non zero saturation message period observed at zero 
synthetic work load ts due to housekeeping tasks and message 
send/receive invocations. 


As messages gct longer there is a disproportionate increase in the 
small workload message event period. This is most likely due to the 
operation of the message system. When a process sends or receives a 
Message it is suspended for the entire duration of that event. Since in 
Medusa messages take about twenty microseconds per word to 
transport, this value becomes significant at longer message lengths. 
In these experiments the message rate tends to decrease as the 
workload increases, since both workload and message passing 
contribute to message system saturation. Thus the contribution from 
the message system Is large at low simulated workloads and small at 
high simulated workloads, resulting in the non linear curves evident 
in the figures. 


Before collecting data on the nearest neighbor emulation a 
calibration experiment was attempted to check the accuracy of the 
emulation scheme. A fully connected network was implemented with 
the cmulation package developed for the nearest neighbor 
emulation. The logical interconnection structure is shown in Figure 
8. The circles labled SO to S5 represent the message switching 
subroutines used to route and forward messages. ‘The other circles 
represent the activities performing the actual processing. Each pair of 
switching node and adjacent activity resides on a single Cm. ‘The 
experiments using one processor per subtask are indicated by the 
solid components. Additional experiments using two processors per 
subtask include the dotted components as well. 


The resulting message saturation data was compared with the 
initial, nonemulated experiment. The curves exhibit a small offset 
from the basic experiment which is duc te the extra overhead of the 
send and receive subroutines, and the periodic polling for incoming 
messages. As can be scen from the curves presented in Figure 9, this 
overhead is a constant 4-6 milliseconds over a wide range of 
workload and message length. Thus its effect can be easily factored 
out, as will be demonstrated in the conclusions section. 


‘These experiments were then repeated for a Nearest Neighbor 
emulation of the multicomputer system. Since there are only three 
subtasks used in these initial trials, the interconnection structure 
actually consists of a line rather than a mesh. Figure 10 shows the 
actual interconnection of the subtasks, where the indicated 
components correspond to those described above for the fully 
connected emulation. ‘The subtasks were arranged in the optimal 
order as determined by their communication behavior. A comparison 
was made with a less optimal order to sce how large the effect would 
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Figure 9: Saturation Curves for the 
Fully Connected Network Emulation 


be. As seen in Figure 1] the message saturation period increased 
dramatically. In future studics with more subtasks, different 
arrangements will have to be tried to ensure optimality. 


The measured workload curves for the nearest neighbor network, 
as shown in Figure 12, again show a decrease in linearity with 
increasing message length. The anomalous behavior of the long 
message length curve at low workloads (i.c. the shorter than expected 
message period) is currently under investigation and appcars to be an 
isolated. though repeatable case, Even with optimal configurations, 
the saturation points occur at a significantly lower message rate (i.e. 
Larger inessage period) than with the basic experiment. Some of this 
is due to the inereased overhead of the message routing subroutines, 
as evidenced in the fully connected emulation results, and some to 
the additional burdens of message forwarding, as would be expected 
with the poorer connectivity. 


After obtaining results for the three processor systems, a similar set 
of experiments was tried with six processors. In addition a ring” 
network was added. Since the three processor ring configuration is 
identical to the fully connected network, there was no need to collect 
separate three processor data. A fully connected network of six 
processors was included to provide a bascline for the other 
architectures and to determine the amount of overhead to subtract 
from. the six processor configurations, 
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Figure 11: Comparison of different Subtask placement 


With the six processor architectures each multiple processor 
subtask is implemented with two processors. ‘The assumption here is 
that processors make intersubtask references directly to the intended 
processor, rather than through an arbitrary link processor in its 
subtask group. it is also assumed that intrasubtask references go 
through the shared memory of its computcr group and do not enter 
into these experiments. With increasing numbers of processors, the 
relaiive merits of the proposed interconnect structures become 
apparent. 
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Figure 12: Saturation Curves for ihe Nearest Neighbor Network 


Figure 13 presents both the three and six processor fully 
connected curves. Note that at large workloads the throughput of the 
six processor cases arc approximately double those of the three 
processor cases, as would be expected from die dominance of 
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processing over message passing at large workloads. If the overhead 
correction faciors obtained earlier from comparisons of the basic and 
fully connected three processor architectures arc applied, it is found 
that a correction factor of one-half the three processor correction 
factor yields almost exactly a factor of two difference for large 
workloads. vhis halving of the correcuion factor 1s consistent with the 
notion that the overhead observed is due to the extra computation 
required by the emulation routines. With six processors tistead of 
three, the emulation overhead per processor is halved for any given 
message rate. The correction factors derived for three and six 
processor cases will b¢ applied te the raw data when making detailed 
comparisons of the interconnection structures later tn the paper. 


The raw data for the six processor nearest neighbor and ring 
networks are presenied in Figure 14. The reduced connectivity of the 
ring network is reflected in its poorer performance as compared to 
the nearest neighbor at long message lengths. though they exhibit 
similar behavior at shorter message lengths. 


Conclusions 

This paper has demonstrated a possibie methed for using Cm* as a 
multiprocessor testbed. Of particular interest to designers of 
multiprocessors is the amount of useful work they can obtain for a 
given application on various architectures. To indicate how this 
might be done an attempt was made to compare the work available 
for applications tasks on the various architectures. The cmulation 
overhead was first subtracted out as described in the results section, 
then the percentage of time the processors were used for actual 
computation was computed. The computation times were calculated 
from the number of synthetic work Joops executed per external 
message and the measured external message rates. Finally 
representative samples were plotted. 
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Figure 13: Saturation Curves for Both Fully Connected Networks 


After calculating the usable processing time as described above, 
the various network configurations can be compared. Figure 
15 compares the long message length (scale of une-half) results for all 
nctwork configurations after adjustment for overhead. At first glance 
it appears that the usable processing power of each processor when 
two processors per subtask are employed is substantially higher than 
with only onc processor. This is duc to the fact that each processor is 
only involved with half of the messages sent or received by the 
subtask. Thus if the amount of processing done by cach subtask is 
directly determined by the incoming message rate, as it is in the 
system modeled in these experiments, the message rate per processor 
needs to be considered. Figure 16 compares the three processor 
configuration of the fully connected network with the six processor 
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Figure 14: Six Processor Nearest Neighbor and Ring Results 
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Kigure 15: Comparison of Adjusted Results for Scale = 1/2 


configuration, after accounting for the lessened message rate per 
processor. Note that the six processor architecture is actually using its 
processors less efficiently than the three processor case. 


Fven though the experiments involved only a small number of 
processors, some interesting results can be seen. Again examining 
Figure 15 it is scen that for this application a six processor ring 
utilizes its processors no more effectively than a three processor 
nearest neighbor. Of course more processing power is available since 
two processors are assigned to cach subtask. Notice though that 
above 20 messages per second a three processor fully connect system 
has more than twice as much available power per processor than the 
six processor ring. ‘Thus it would definitely be the prefered system at 
the high message rates. Below 20 messages a second the six processor 
ring has more total processing power available and might be prefered 
if large amounts of processing were required. It would be cheaper 
than cither the six processor nearest neighbor or fully connected 
systems, provided it supplied sufficient processing power. 


Of particular interest is the relative performance of the three 
multiprocessor architectures when six processors are employed, since 
the relative connectivity of the interconnection structures will have a 
more significant effect on the results. Figure 17 comparcs the curves 
produced by all three structures when one-half and one-fourth length 
messages are used. The effects of emulation overhead have been 
factored out in these graphs, so that direct performance comparison 
is possible. ‘The ring and nearest neighbor networks provide similar 
amounts of processing time to their application tasks, though at long 
message lengths the nearest neighbor does perform slightly better. In 
all cases the fully connected architecture performs markedly better, 
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for per processor message rates 
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Figure 17: Comparison of the interconnection schemes 


performing almost as well at long message Iengths as the other two 
architectures at short messages lengths. At an external message rate 
of 20 messages per second the fully connected network provides 
almost 2.5 times the processing potential of the nearest neighbor 
network. Although the fully connected network requires fifteen 
interconnection links to the nearest neighbor’s seven, it would 
probably be the better choice. 


These experiments demonstrate that it is possible to use Cm* to 
emulate other multiprocessor architectures. The methods employed 
can easily be extended to emulate additional multiprocessors 
covering a large class of interconnection mechanisms as well. Though 
some of the overhead incured in emulating the routing algorithm’ is 
due to the emulation software, this can be measured by comparison 
with nonemulated systems such as the basic experiment. Once the 
overhead is subtracted detailed studics of comparative network 
performance are possible. It is expected that the Cm* testbed will 
prove useful for studying gencral multiprocessor architecture issues 
as well as designing specific multiprocessors for specific applications. 
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SLOT-BASED MULTI-ACCESS PROTOCOL FOR LOCAL COMPUTER NETWORK 


A.I. Noor 
Carleton University 
ABSTRACT 


This paper presents a novel protocol for a 
Local Area Network with single channel communica- 
tion. Users are grouped according to their physi- 
cal location and each group is assigned a channel 
slot. Conflicts are resolved by an Assigned Slot 
Carrier Sense Multiple Access Mechanism with 
Collision Detection Capability (ASCD) protocol. 
The maximum throughput and average delays are 
evaluated. The results indicate that the perfor- 
mance of this protocol is better than many conten- 
tion type protocols widely used in Local Area Net- 
works. This protocol is best suited to transac- 
tion oriented messages, which are found in the 
real-time process control industry. 


1. INTRODUCTION 


Current trends in hardware encourage the 
abandonment of a single large computer in favour 
of a number of small machines. The resulting de- 
centralization of computing power is, for many 
applications a natural and obvious pattern. In 
these applications, the information itself is dis- 
tributed in nature and is best managed by a net- 
work of machines. Thus, it has become very at- 
tractive to connect a number of microcomputers to 
form a resource sharing computer network. Reli- 
ability and throughput are improved. Small pro- 
cessors can be designed and implemented more 
quickly thus it becomes practical to update to the 


latest and most cost-effective hardware technology. 


Applications such as plant and laboratory 
automation are made possible by computer network- 
ing. In these applications each microcomputer 
station monitors and controls an elementary part 
of the overall plant. Like any other communica- 
tion network, this network must have a communica- 
tion medium (channel). The microcomputer stations 
require an interface in order to transmit/receive 
messages using the channel. The need for channel 
access protocol arises because the communication 
medium is shared by a number of stations. Con- 
flicts which arise when more than one demand is 
simultaneously placed upon the channel give rise 
to multi-access protocol. Contention type proto- 
col is attractive if each station independently 
decides when to transmit. Carrier Sense Multiple 
Access with Collision Detection (CSMA-CD) [1] is 
a popular contention protocol. Each user senses 
the channel and if the channel is idle starts its 
transmission. When a collision is detected the 
transmission ceases and the channel is jammed for 
a period (equivalent to 30 bits in Ethernet) [2] 
to make sure that all the users are aware of the 
collision. Colliding users each select a random 
backoff delay after which retransmission is at- 
tempted. Metcalfe and Boggs [2] proposed a binary 
exponential backoff algorithm. The performance of 
this algorithm and similar backoff algorithms has 
been investigated by various authors [3-6]. 


The main reason for collision is the absence 
of coordination among users, and long distance 
physical separation. One user does not know when 
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another user starts message transmission. 


In this paper an alternative to CSMA-CD 
protocol that eliminates the need of a collision 
enforcement mechanism is proposed. This protocol 
is designed to minimize the collision resolution 
time and hence maximize the channel throughput, 
by eliminating collision enforcement and the 
probability of repeated collision. 


In Section 2 of this paper the alternative 
protocol is described. Section 3 describes the 
Simulation model. Sections 4 & 5 present the re- 
sult for simulation tests. Section 6 describes 
the design of the microcomputer network, based on 
this protocol. 


2. ASSIGNED-SLOT-CARRIER SENSE MULTIPLE ACCESS 


PROTOCOL WITH COLLISION DETECTION (ASCD) 


In this proposed protocol users monitor the 
communication channel and record its state. A 
user attempts transmission only when the channel 
is in the idle state. If a collision is detected, 
all but one user backs off immediately, the unsuc- 
cessful user attempts retransmission when the 
channel becomes idle. 


Under this scheme the channel is divided into 
time frames each containing an equal number of 
slots. The users connected to the network are 
divided into a number of groups. The users are 
grouped according to their physical separation. 
The two furthest members of a group can be d 
kilometers apart 


2:5C 
Se ee 
¢ o£ 
where c = velocity of light, 
f = the operating frequency of the communi- 


cation medium. 


Typically, f = 10° bits/sec and the distance, d, 
is less than 0.1 kilometers. If this distance is 
exceeded complete backoff is required. 


Each group is assigned a slot in a time 
frame. The members sense the channel and transmit 
in its assigned slot. Sensing and transmission is 
started at the slot boundary. There is still a 
chance that two users in the same group sense the 
channel as idle and start transmission at the same 
time. In this event collision causes the users to. 
backoff and reschedule their transmission. 


Performance Measurement Criteria 


The foremost. measure of the network's perfor- 
mance is (i) channel utilization, and (ii) average 
message delay. 


The channel utilization is the "ratio™ of the 
time the network is successfully carrying packets 
and the time the network is busy. The average 
message delay is defined as the average interval 
between a user's desire to transmit and the suc- 
cessful reception of the packet by the destined 
user. 


3. SIMULATION MODEL 


A simulation program, to determine the per- 
formance of the protocol, was written in SIMULA 
[7] and used the report generating facilities of 
DEMOS [8]. The simulation program is based on the 
following assumptions: 


~ Poisson message arrival. 

- The packet lengths are uniformly distributed. 

- The packet has a maximum length of 1024 bits. 

- A user either transmits or receives. 

- The channel is assumed noiseless. 

- Colliding users backoff for a random time. 

- A user can migrate to a slot without any delay. 
Collisions are detected within one bit. 


The parameter values used in the simulation 

are given in Table I. 
TABLE I 
PARAMETERS VALUE 


Speed of the channel 8 million bits per second 


Maximum Packet length 1024 bits 
Minimum Packet length 256 bits 
Number of users 63 

Number of slots 8 

Length of the slot 8 bits 


Number of Maximum Tries 32 
4. SIMULATION RESULTS 


Several observations are made about ASCD 
protocol. Backoff probability plays an important 
role. Figure 1 shows that channel utilization is 
improved with higher sensing probability, ep. As 9p 
approaches 0, the delay gets arbitrarily large 
(due to the large retransmission delays). For 
o > 0.1 the relative change in performance becomes 
small. 


Average delay are shown in Figure 2. The 
knee moves to the right as the packet size in- 
creases indicating higher utilization. The vari- 
able packet has an average delay of less than 0.5 


milliseconds for average utilization of up to0.92. 


In the real-time environment the average load 
is not high, but faster consistent response is re- 
quired. Figure 3 shows both the mean and the 
standard deviation of the delay due to uniformly 
distributed packet size. This indicates that the 
protocol can perform well without the loss of 
stability. 


5. PERFORMANCE SUMMARY 


A number of observations are made in this 
analysis and they are: 


- The protocol is stable and utilization is a de- 
creasing function of load. | 

~ The variance in response time is small. 

—- The slot switching mechanism improves the chan- 
nel utilization and reduces the average delay. 

- This protocol is better than other contention 
base protocols, because of its shorter response 
time and increased effective transmission rate. 


Figure 4 compares the delay-load relation- 
ship of the ASCD protocol for various packet 
sizes. For $8 = 64 and 128 there is an improvement 
in performance over other protocols in the 104 to 
10° packets/sec average arrival rate range. For 


8 = 512 and 1024 there is an improvement through- 
out the whole average arrival rate range due to 
the effect of slot switching. 


Figure 5 describes the channel utilization 
for the ASCD over the arrival rate range. This 
protocol has improved response and stability. 


6. THE DESIGN OF THE PROTOCOL HARDWARE 


The goal is to design and build an economi- 
cal, efficient network to interconnect user sta- 
tions. Each station is likely to have an 8/16 
bit microcomputer, mass storage or peripheral 
controller. Our particular application is the 
real-time operation of an Electric Substation in 
which certain messages must have priority. These 
characteristics are found in other real-time pro- 
cess control applications which require a hierar- 
chical or super user relationship between stations. 


The node has two parts. The first part is 
the user module which is the plant processor part 
of each station. The second part or Universal 
Interface Module, (UIB) executes the communication 
protocol. The UIB is built around an 8-bit micro- 
computer with special hardware to form a message 
from a packet. The UIB formats the message from 
the user's module into a packet with destination 
address(es) and information identifiers. The mes- 
sage part is submitted directly from the memory of 
UIB-processor. The remainder of the packet frame 
and the collision control is provided by the com- 
munication medium interface. An 8-bit shift 
register, identifies its time-slot and also iden- 
tifies adjacent time-slots for possible migration. 


The UIB hardware is based on INTEL-8085 pro- 
cessor with 2K of RAM and 4K of ROM, memory-map- 
ping logic, programmable interrupt controller, and 
programmable interrupt time. The complete hard- 
ware portion uses about 100 IC's. 


CONCLUSION 


This protocol has been developed for the 
real-time application where messages are mostly 
transaction oriented. Advantages as confirmed 
through simulation are: 


- shorter response times 
- increased effective channel utilization 
~ message priority discussion 


The cost of the UIB is anticipated to be reason- 
able since it has been implemented with about 100 
IC's and a microcomputer. The general constraints 
are limited physical distance between stations and 
short messages. 
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NEW CONNECTIVITY AND MSF ALGORITHMS FOR ULTRACOMPUTER AND PRAM 
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ABSTRACT 


Parallel ULTRACOMPUTER algorithms for finding 
the connected components and a minimum spanning 
forest (MSF) of an undirected graph are presented: 
Both have depth of O((log n)2) where n_ is the 
number of vertices in the graph and n2 processors 
are used. Both algorithms are implementations of 
PRAM algorithms that are presented first. 


The connectivity PRAM algorithm is a simplifi- 
cation of the one appearing in [10]. A modifica- 
tion of this algorithm yields a simple and effi- 
cient MSF algorithm. Both have depth of O(log n) 
and they use 2E processors, where E is the 
number of edges in the graph. 


1. INTRODUCTION 


The ULTRACOMPUTER (later denoted as UC) and 
the PRAM are two models of parallel computation. 
The first consists of a set of processors, each 
having a local memory and random-access capabil- 
ities. They communicate via a shuffle-exchange 
network. A detailed description of this model and 
several basic algorithms in it, can be found in 
[6] and [7]. In the PRAM model the processors 
share a common memory and each of them has access 
to each memory cell. Variants of this model differ 
in the capability of performing concurrent READs 
and/or WRITEs from the same memory cell. (See [9].) 


In this paper we describe two UC algorithms, 
for computing connected components and minimum 
spanning forest in an undirected graph. The first 
problem has an 0((log n)4) solution using nt+E 
processors. This algorithm is presented in [7] 
following a PRAM connectivity algorithm of [3], 
having depth of O((log n)#) UC steps. A faster 


PRAM algorithm of O(log n) depth, is given in [10]. 


This algorithm and its proof are further simplified 
in this paper. A new technique of PRAM simulation 
by UC, realizes this algorithm on a UC in 

0( (log n)2) time using n“ processors. Similar 
ideas lead to PRAM and UC MSF algorithms of the 
same depth and number of processors as in the 
connectivity algorithms. 


Existing PRAM MSF algorithms have 0( (log n)2) 
depth ([2] and [9]). These algorithms can be im- 
plemented in a UC in O((log n)*) time using ideas 
similar to those described in [7]. No faster UC 
MSF algorithm is known to us. 


2. A SIMPLE PRAM CONNECTIVITY ALGORITHM 


Our connectivity algorithm is a modification 
of that appearing in [10] which has a much simpler 
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proof of logarithmic depth. It deals with exactly 
the same model as in [10], i.e. simultaneous read- 
ing and writing are allowed. In the latter case 
one processor succeeds but we don't care which. 
Two processors P(i,j) and P(j,i) are assigned to 
each edge (i,j). 


The notions of ‘rooted tree', 'rooted star', 
"pointer graph', 'shortcutting’ and "hooking' are 
the same as in [10]. The father and grandfather 
(= father of father) or a vertex i in the pointer 
graph are denoted by D(i) and G(i) respectively. 
Each step of the algorithm below is executed by 
each processor P(i,j). 


The Algorithm. 


Step 1: Conditional hooking, (similar to Step 2 in 
[10]). If G(i) = D(i) and D(i) > Dj) 
then D(D(i)) + D(j). 


Step 2: Unconditional star hooking. 
If i belongs to a star and D(i) # D(}) 
then D(D(i)) « DG). 


Step 3: If i belongs to a star then STOP. 
Else D(i) + G(i) (Shortcutting). 


The algorithm loops on the three steps above 
until Step 3 does not produce any non-trivial 
shortcuts. 


COMMENTS : 


1. In order to make the first iteration look like 
its successors, the input graph is slightly 
modified by connecting a dummy vertex i' to i 
for all i=1,...,n. Moreover, the pointer 
graph is initialized by: D(i') = D(i) = i. 


2. The condition 'i belongs to a star' is checked 
by a simple STARCHECK procedure. A field ST(i) 
is attached to each vertex i, indicating whether 
it belongs to a star or not. STARCHECK is based 
on the observation that a vertex i does not 
belong to a star iff it has a nontrivial grand- 
father or grandson or nephew (= grandson of 
father). 


STARCHECK 
ST(i) « TRUE 
IF D(i) # G(i) then ST(i),ST(G(i)) « FALSE 
ST(i) « ST(G(i)). 
(partial correctness): 
If the algorithm terminates then 
a. The pointer graph consists entirely of stars. 


b. The vertices of each star form a connected 
component of the graph. 


Proof: 


a. Follows immediately from the termination condi- 
tion in Step 3. 


b. It is easy to see that the partition defined 
by the pointer graph is always a sub-partition 
of that defined by the connected components. 
Moreover, a proper subset of a connected compon- 
ent cannot form a star in the pointer graph at 
the beginning of Step 3 since stars are un- 
conditionally hooked in Step 2. 


Theorem 2 (logarithmic depth): 
a. Stars are not hooked on stars in Step 2. 


b. The pointer graph is always a forest of rooted 
trees. 


c. If C is a connected component of G_ then the 
sum of the heights of the trees in the pointer 
graph consisting of vertices of C decreases 
by a factor of 3/2, at least, after each applica- 
tion of the loop, until these trees form a star. 


Proof: 


a. A star at the beginning of Step 2 has not been 
changed in Step 1 since the hooking in Step 1 
produces trees of height two at least, (a result 
of the initialization). At the beginning of 
Step 2 there exist no edge with both endpoints 
belonging to different stars. Otherwise one of 
the edge processors should have attempted to 
hook the larger root on the smaller one at Step 
1 making it a non-root at the beginning of 
Step 2. 


b. The statement is true initially. Steps 1 and 3 
never produce any directed cycle. By a. stars 
are hooked only on non-stars in Step 2. Since 
non-stars are not further hooked on anything 
in that step, cycles are not formed in Step 2 
either. 


c. As a result of the initialization step, all 
the trees are of height one at least and there- 
fore hooking is never done on leaves. Thus, 
hooking one tree on another yields a tree of 
height not greater than the sum of the heights 
of the original trees. Therefore, Steps 1 and 
2 do not increase the sum of heights of any 
set of trees. Let C be a connected component 
of G. If the set of trees in the pointer graph 
consisting of the vertices of C, has not shrunk 
yet to a single star, then at the beginning of 
Step 3 none of them is a star. Hence,at Step 3 
their sum of heights decreases by a factor of 
3/2 at least. 


3. A PRAM MSF ALGORITHM 


As in [8], our MSF algorithm is also an im- 
plementation of Sollin's algorithm that is brought 
in [1]. Our improvement of their result is similar 
to the improvement of [10] with respect to the 
connectivity algorithm of [4]. The depth and 
number of processors are the same as in the connect- 
ivity algorithm above. 


This algorithm, however, assumes a stronger 
model of computation in which the lowest processor 
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writes in case of concurrent writing to the same 
location. As in the connectivity algorithm above, 
two processors P(i,j) and P(j,i) are assigned 
to each edge (i,j) of the graph. The algorithm 
needs a preprocessing assignment phase in which 
processors are assigned to edges in such a way 

that processors with lower identity numbers cor- 
respond to edges with lower lengths. In the worst 
case sorting of the edges with respect to their 
lengths might be needed which takes.O(log n) or 

0( (log n)2)) time according to the number of avail- 
able processors, (see [5] and [9]). 


Previous results on this problem ([2] and 
[8]) achieved depth of (log n)4 with n2 and 
(n2)/( (log n)) processors respectively. . Both 
works assume the-'concurrent READ exclusive WRITE' 
PRAM model. 


The following variables are used. 


D(i) denotes the father of i in the pointer graph. 

TREE (i,j) is a Boolean variable corresponding to 
P(i,j). By the end of the algorithm an edge 
(i,j) belongs to the minimal spanning forest 
iff TREE(i,j) = 1 or TREE(j,i) = 1. 

ATTEMPT (i) is an auxiliary variable corresponding 
to vertex i. 

L(i,j) is the length of Edge (i,j). 


MSF ALGORITHM 


Initialization: TREE(i,j) < 0, D(i) + i; 
ATTEMPT (i) «+ NIL for all i,j. 


Step 1: Star Hooking 
If i belongs to a star and D(i) # D(j) 
Then ATTEMPT (D(i)) <« (i,j) 
Tf (i,j) = ATTFMPT (D(i)) 
Then TREE(i,j) +1 and D(D(i)) «D(4) 


Step 2: Tie Breaking 
If i< D(i) and i = G(i) 
Then D(i) <« i 


Step 3: If i belongs to a star then STOP. 
Else D(i) + D(D(i)) 


The algorithm loops on these three steps until 
Step 3 does not produce any real shortcut. 


COMMENT : 


All P(i,j) that want to hook D(i), try 
competitively to write their identities into ATTEMPT 
(D(i)) where the winner's identity i eventually 
stores. Only the winner succeeds in hooking and 
its edge is inserted to FOREST which is the set of 
all edges (i,j) for which either TREE(i,j) or 
TREE(j,i) are 1. 


In order to prove the correctness and logarith- 
mic depth of the algorithm the following lemmas are 
needed: 


Lemma 1: At each step of the algorithms: 
a. FOREST is a subset of an MSF. 


b. The connected components of the pointer graph 
are identical to the connected components of 
FOREST and constitute a partition of the con- 
nected components of the graph. 


c. The pointer graph is a directed forest with 


the exception of self-loops at roots and cycles 
of length 2 that exist only between Steps l 
and 2. 


Proof: 


a. It can be assumed that edge lengths are all 
different. This is achieved by taking the 
length of an edge (i,j) as the triple L(i,j) = 
(length of Edge (i,j), min (i,j), max (i,j)) 
and considering the lexicographic order. This 
yields a unique MSF. Whenever TREE (i,j) + 1, 
the edge (i,j) is the shortest edge emanating 
from the set of vertices determined by the star 
rooted at D(i,j). Thus (i,j) belongs to the 
MSF, and therefore FOREST is always a subset of 
the MSF. 


b. This statement can be proved inductively since 
whenever two connected components of the pointer 
graph are connected, so do the corresponding 
components of FOREST and the latter ones are by 
definition a partition of the connected compon- 
ents of the graph. 


c. Assume that the statement is true at the begin- 
ning of the current iteration. Consider an 
application of Step 1. Any new edge added at 
this step emanates from a star root to a non- 
leaf vertex in another tree. Suppose that a 
directed cycle C was created by Step l. 

Since C must contain new edges, and since an 


old edge cannot be followed in C by a new edge, 


C must contain only new edges stitching the 
star roots. Thus if C contains a new cycle 

of length > 2 so does FOREST - contradicting a. 
Cycles of length 2 are opened at Step 2. The 
proof is completed by observing that Step 3 

does not introduce any new cycles to the pointer 
graph. 


Theorem: 


Upon termination FOREST is the MSF of the graph. 


Proof: 


It is easy to see that by the end of the algo- 
rithm the connected components of the graph coincide 
with those of the pointer graph. Thus by statement 
B above FOREST spans the graph and by A it is an 
MSF. 


Theorem: 


The algorithm terminates after at most log n 
iterations. 


Proof: 


The proof is similar to that of the correspond- 
ing statement in the connectivity algorithm and 
therefore omitted. 


4, SUPERCOMPUTER IMPLEMENTATION OF THE 
CONNECTIVITY AND MSF ALGORITHMS 


The SUPERCOMPUTER (later denoted as SC) is an 
intermediate level between the PRAM and the UC. It 
is a UC with a full communication network enabling 
direct communication between any pair of processors. 
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Thus, each processor can read and write into each 
other processor's memory. However, concurrent 
READs and WRITEs from/to the same processor can be 
executed only if one memory cell is accessed. In 
case of WRITE the minimal value is written. 


In this section we implement the PRAM algo- 
rithms above by an SC with n+2E processors P(i,j) 
and P(j,i). for each edge (i,j) and P(i,i) for 
1l=1,...,n. In the next section, the SC algorithms 
will be further implemented by a UC. 


Both SC algorithms use three basic instruc- 
tions READ, WRITE and TRANSPOSE. 


READ has the form: READ (OLDFIELD from P(ADR,ADR) 
into NEWFIELD). The exact effect of this 
statement is that P(i,j) reads OLDFIELD in 
P(ADR,ADR), and copies it into NEWFIELD which 
is a new field of its own. (ADR itself is 
a field in P(i,j)). 


WRITE has the form: WRITE (OLDFIELD at P(ADR,ADR) 
into NEWFIELD) which means that P(i,j) writes 
the value stored in OLDFIELD into the new 
field NEWFIELD at P(ADR,ADR). 


TRANSPOSE has the form: TRANSPOSE (OLDFIELD to 
NEWFIELD) and it transfers the value stored 
at P(i,j) in OLDFIELD to P(j,i) into NEWFIELD. 


Each processor P(i,j) stores the following 
variables in the fields: 


Variables: D(i) D(j) G(i) ST(i) i  ATTEMPT(i) 
1 Gir) 

Fields: D1 D2 G ST I ATTEMPT 
LENGTH 


THE CONNECTIVITY SC-ALGORITHM 


Step 1: READ (D1 from P(i,I) into D1) 

TRANSPOSE (D1 to D2) 

READ (D1 from P(D1,D1) into G) 

If G=D1 and D1> D2 

Then WRITE (D2 at P(D1,D1) into D1) 
Step 2: STARCHECK /* See routine below */ 

TRANSPOSE (D1 to D2) 

If ST=TRUE and D1#D2 

Then WRITE (D2 at P(D1,D1) into D1) 
Step 3: SHORTCUT 

If ST= TRUE then STOP 

Else READ (Dl from P(I,I) into D1) 

READ (D1 from P(D1,D1) into D1) 

STARCHECK 


READ (Dl from P(I,I) into D1) 
READ (D1 from P(D1,D1) into G) 
ST «+ TRUE 

If G#DI1 

Then ST + FALSE 


WRITE (ST at P(G,G) into ST) 
READ (ST from P(G,G) into ST) 


SC MSF ALGORITHM; 


Step 1: STARCHECK 

TRANSPOSE (D1 to D2) 

If DL#D2 and STAR= TRUE 

Then WRITE (LENGTH at P(D1,D1) into ATTEMPT) 
READ (ATTEMPT at P(D1,D1) into ATTEMPT) 
If LENGTH = ATTEMPT 
Then WRITE (D2 at P(D1,D1) into D1) 

TREE <« 1 


Step 2: READ (D1 from P(I,I) into D1) 
READ (D1 from P(D1,D1) into G) 
If G=I and Dl > I then DI +I 
Step 3: SHORTCUT 


If ST=TRUE then STOP. 
Else READ (D1 from P(I,I) into D1) 
READ (D1 from P(D1,D1) into D1) 


5. UC IMPLEMENTATION OF SC INSTRUCTIONS READ, 
WRITE AND TRANSPOSE 


In this section the three SC operations above 
are simulated by an N=n2-UC. Each operation is 
simulated in O(log n) steps yielding total depth of 
0( (log n)2) for both algorithms. To each ordered 
pair of vertices <i,j>, a processor P(i,j) is 
attached regardless whether (i,j) is an edge of G 
or not. A single bit BIT stored in the local memory 
of each processor, indicates whether P(i,j) is an 
edge-processor or not. The edge-processors of UC 
simulate the corresponding SC processors while the 
non-edge UC processors serve for communication 
purposes. Only edge-processors participate in 
WRITEs. As usual for UC, n is assumed to be equal 
to 2k for some integer k, (thus N=22k). The 
simulation is still sub-divided into higher and 
lower level implementations. 


5.1 The Higher Level 


In this level READ, WRITE and TRANSPOSE are 
implemented by lower level routines BROADCAST, 
CHOOSE and TRANS. These low-level routines are 
implemented in the next subsection by the basic UC 
routine GROUPSUM. 


CHOOSE has the form : CHOOSE (OLDFIELD into NEWFIELD). 


It chooses a minimal value among all non-NIL values 
stored at OLDFIELD of P(i,j) for fixed i and all 

j, and inserts it into NEWFIELD at P(i,i). If the 

values in OLDFIELD are NIL for all P(i,j), 1s<js<n, 

then the resulting value of NEWFIELD at P(i,i) does 
not change. 


BROADCAST has the form: BROADCAST (OLDFIELD into 
NEWFIELD). It inserts the value stored at OLDFIELD 
of P(i,i) into NEWFIELD of P(i,j) for j=1,...,n. 


TRANS has the form: TRANS (QLDFIELD to NEWFIELD). 
It inserts the value stored at OLDFIELD of P(j,i) 
into NEWFILED of P(i,j). 


Let FO, FN and AD denote OLDFIELD, NEWFIELD and 
ADDRESS respectively. 


READ (FO from P(AD,AD) into FN): 
BROADCAST (FO into F1) 
TRANS (Fl to F2) 

If j;# AD 
Then F2 + NIL 


CHOOSE (F2 into F3) 
BROADCAST (F3 into FN) 


WRITE (FO at P(AD,AD) into FN): 
If BIT = O then FO < NIL 


CHOOSE (FO into Fl) 
BROADCAST (F1 into F2) 
If 4 #AD 

Then F2 ~ NIL 

TRANS (F2 to F3) 
CHOOSE (F3 into FN). 


TRANSPOSE (FO to FN): 
TRANS (FO to FN) 


5.2 The Lower Level 


Let '+' denote any associative binary operation 
and let A=aj,.-.,an be any N-vector where a; is 
stored in Py for all i. The output of (LEFT) SUM is 
the vector of partial 'sums' aj,ajta2,a]ta72ta3,---; 
a,t+...tay such that ajt...taz is stored in Pi for 
all i. The LEFT GROUPSUM (LGS) operation is similar 
to SUM. The difference is that here the vector aj5--+; 
ay is divided into several segments and the result 
is that of SUM operating on each segment separately. 
When the summation starts from the rightmost element 
of each segment we obtain the RIGHT GROUPSUM (RGS). 


GROUPSUM of a vector of length N can be done in 
log N time by an N-UC as shown in SCH-80. 


Realization of CHOOSE and BROADCAST by GROUPSUM. 


In the following realizations, our GROUPSUM 
segments consist of all P(i,j) for a fixed i. 
associative binary operation is defined by: 
AtB=min(A,B) where NIL is considered as bigger 
than any non-NIL value. 


Our 


CHOOSE (FO into FN): 
Fl < LGS (FO) 


/* At this point the field Fl in P(i,n) contains 
a minimal non-NIL element if such exists in its 
segment and NIL otherwise. In the first case this 
non-NIL element is inserted to FN at P(i,i). */ 


If P(i,j) = P(i,n) and F1#NIL 
Then FN < RGS (F1) 


BROADCAST (FO into FN): 
Tf PUi.i) #PG,4) 
Then FN < NIL 
Else FN <+ FO 
LGS (FN) 

If P(i,j) # P(Ci,n) 
Then FN < NIL 
RGS (FN) 


TRANS (FO to EN): 
FN <« FO 
DO log n times: 
FN <« SHUFFLE (FN) 


/* SHUFFLE is the basic UC operation defined by: 
2i if i< N/2 
SHUFFLE(i) = | 
2i-NM1 else 


TRANS is exactly the matrix transposition algorithm 
appearing in [6]. */ 
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Bridge-connectivity and Biconnectivity Algorithms 
for Parallel Computer Models 
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Abstract 


New § algorithms for the  bridge-connectivity and 
biconnectivity problems are presented. It is shown that these 
algorithms can be implemented on any parallel computer 
models on which an ordinary matrix multiplication algorithm 
exists and that the hardware resources required(in terms of 
the number of processors and the chip area) are no more 
than those required by the matrix multiplication algorithm. 
_ The time required is at most a factor of max(logd,loga")+ 1, 
isd,d'sn, greater than that needed by the matrix 
multiplication algorithm. Since most of the existing parallel 
computer models have efficient ordinary matrix 
multiplication algorithms, the algorithms presented here turn 
out to be very efficient. 


1. Introduction , 

The result of this paper is motivated by the following 
observations. 

7. For most of the existing parallel computer models, there 
exist few algorithms for graph theoretic problems. This ts 
especially true for those models which obey the VLSI design 
constraints[8,9,14]. On the other hand, an important basic 
operation, namely the matrix multiplication, has an efficient 
algorithm on almost every parallel computer model[3,8,9, 11]. 
Since graphs are usually stored in the form of matrices in 
computers, it is conceivable that matrix multiplication or the 
techniques it uses may be useful in developing efficient 
algorithms for graph theoretic problems. As a matter of fact, 
Dekel, Nassimi and Sahni have exploited this idea[3]. 


2. It usually happens that whenever an algorithm is presented, 
it is designed with a particular model in mind, and its 
complexity analysis is provided for that model only. Extra 
effort has to be made in order to carry it over to the other 
models. A typical example is Hirschberg’s graph-connectivity 
algorithm which was originally designed for the SIMD-SM-R 
modell2,4] It was then implemented on the SIMD-MCC 
model by Nassimi and Sahnil6]; on a SIMD-SM-RW by 
Shiloach and Vishkin[ 13,17] and finally on the PSN, OTN, and 
OTC models by Nath, Maheshwari and Bhatt[7,8]. It would be 
convenient if the complexity analysis of an algorithm could 
be given in such a way that it would be valid for any model 
satisfying certain reasonably moderate conditions. 


In this paper, we shall design parallel algorithms for 
the bridge-connectivity and biconnectivity problems. We 
then analyze their complexities for any parallel computer 
models on which an ordinary matrix multiplication algorithm 
exists. Since almost every existing model has an efficient 
algorithm for matrix multiplication, the condition imposed is 
not severe. Let Ojt(n)) and AH(n) denote the time and hardware 
resources(in terms of number of processors and chip area) 
required by the ordinary nxn matrix multiplication algorithm. 
We will show that the time and hardware complexities of 
our algorithms are bounded above by 
Olt(n)*(max(logd,logd”)+ 1)), 1sd,d"sn, and H(n) respectively 
on those models. 


2. Basic Definitions . 
The definitions of the graph theoretic terms used in this 
paper are standard and can be found in various texts. 
Definition: let 7(V,£’) be a directed tree and u,veV. 
uv if wis an ancestor of v in 7. : 
u<v if uw is a proper ancestor of v in 7. 
Definition: The /owest common ancestor , 
vertices u, v in7 is defined as: 
LCAlu,v) = (max4){wiwsu and wav }. 
Definition: Let veV, the level of vin Y is defined as: 


LCA(u,v), of two 
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level(r)= 

level(v)=/evel{F(v))+ 1, v#r. 

where rs is the root of 7 and F(v) is the father of v in 7. 

Note that /eve/(v)>0O ¥veV. 

We denote /eve/(LCA(u,v)) by /LCA(u,v). 
Definition: Let véV and 7 is a directed spanning tree of an 
undirected graph G(V,£). 

HLCA(v)=(mina)({LCA(v,w)l(v,w) is in G-7 }u{v}). 

/HLCA(v) = /eve/(HLCA\v)). 

alv}=min{/HLCA(w)lvaw }. 


3. Outline of the Algorithms 
Algorithm:Bridge~-connectivity 

1. Find a directed spanning tree 7(V,£") of G(V,£); 
2. Compute /LCA(u,v) ¥lu,v)eVXV; 

3. Compute /HLCAIv) ¥veV; 

4. Compute av) ¥veV; 

5. Test if /eve/(v)sodv) 
level(v)salv) *) _ 

6. Delete all the bridges in G and find the connected 
components of the resulting graph. (* These are the 
bridge~connected components of G ¥) 


¥lu, vee’. (* (u,v) is a bridge iff 


Algorithm:Biconnectivity 
1. Find a directed spanning tree 7(V,£") of G(V,£); 
2. Compute /LCA(u,v) ¥iu,v)eVXV; 
3. Compute HLCA(v) and /HLCA\Iv) veV; 
4. Construct an undirected graph G"(E',£”) such that 

((F(u),u),(Flv),v) E" iff =HLCA(u)<vau or HLCA\v)<udv 

or (u,v) is in G-7 and ukv, vu. 

5. Find the connected components {C,;} of G’. (* Each of 
them uniquely determines a biconnected component in G and 
vice versa. *) Find the roots {spt;} of the induced trees 
{79C; }. (* {spt;} forms the set of separation vertices of 
G. r is excluded unless r= =spt, =spt; for some /#/ *) 


4. Correctness of the Algorithms 
The correctness is based on the following theorems. 


Lemma 4.1: If e is a bridge in G, then ee’. 


Theorem 4.2: (F(v),v) E' is a bridge in G iff /eve/(v)Satv). 
Proof: If (F(v),v) is a bridge, then there is no fundamental 
cycle containing (F(v),v) in G. Therefore for all descendants w 
of vin7, /HLCA(w)2/eve/(v). Hence /eve/(v)<alv). 

If (Fiv),v) E' is not a bridge, then there is a fundamental cycle 
C containing (F(v),v). Let C be determined by (x,y) where (x,y) 
is in G-7. Without loss of generality, we assume v4x. Clearly, 
a(v)</HLCA(x)</LCA(x,y)</eve/{v). g 


Denote uAv iff ((F(u),u),(F{v),v))€£”. It is easy to 
prove the following lemma: 
Lemma 4.3: If w,Rw,, w,Rw,, ..., We-.RWe , then (F(w,),w,) 
and (F(Wg ),We ) belong to the same cycle in G. 


Theorem 4.4: (F(u),u) and (F(v),v) belong to the same 
connected component in G” iff (F(u),u) and (F(v),v) belong to 
the same biconnected component in G. 

Proof: The ‘only if’ part is obvious due to Lemma 4.3. 

Let (F(u),u) and (F(v),v) belong to the same biconnected 
component in G. There exists a simple cycle C containing 
(F(t/),u) and (F(v),v). Let Cy, C,, ... Ce be the set of fundamental 
cycles such that C=@®{C,},.;e is the mod-two_ sum). 


Without loss of generality, let (Flu),ujEC,, (Flv),vieCy , and 
(Fiw; ),wy )€Cye and Cys 1$/</. Let (a; De ) be the edge in 
G-T determining C; 1s/s/, then. in each C; , one of the 


following must hold true: (i a, Rb; and (w,-,Ra; or 
w,.Rb;,) and (wy Ra; or w; Rb; ); (ii) ay Rw,., and 
ag Rw; . In any of the cases, there 


; (iii) bg Rw;., and by Rwy 


is a path from (F(wy-1),W-1) to (Fiw, ),we ). In particular, there 
is a path from (F(u),u) to (F(w,),w,) in C, and a path from 
(F(We ),We ) to (Flv),v) in C . Joining all these paths together, 
we have a path from (F(u),u) to (Flv),v) in G". Hence, (F(u),u) and 
(F(v),v) belong to the same connected component in G”. o 


5. An Implementation Based on Matrix Multipliction 

On the parallel computer models we consider, we assume 
that there exists an ordinary matrix multiplication algorithm 
and that each processor is capable of carrying out any of the 
operations +,—,%,4,V,7,=,4#52 in constant time. O is used to 
represent the boolean constant ‘true’ while 1 is used to 
represent ‘false’. Each processor contains a constant number 
of registers. Communication between interconnected 
processors and between registers within the same 
processor takes constant time. The undirected graph is 
represented by an adjacency matrix M such that entry MI/,/] 
is stored in PE{/,/]. A register, say A, in PE[/,/] is denoted 

by Al/,/] Without loss of generality, we assume @ is 
connected and V= {1,2,3....,7}. 


Definition: A function f is called an extended monadic 
function w.r.t. i, j if the arguments of f are of the form 
OPIi,/] where OP is either the name of a register or a 
function of /, /. we denote it by /[/,/]. 


Definition: An ordinary matrix multiplication algorithm is 
an algorithm which uses only the associativity of +. 


Lemma 5.0: If an ordinary matrix multiplication algorithm for 
two nxn matrices exists on a computer model, then the 
following operation could be carried out on the same model 
using the same order of magnitude of time and hardware 
resources. 

MAL SL=FLAQALLAL ALK JW) ¥ LS, ISLLS8N, 
where f,, f,, f; are extended monadic functions w.r.t. /,k; k,/ 
and /,/ respectively. g is a composite function of the 
arithmetic and boolean operations mentioned above and /7 Is 
an associative operator. 
Proof: Trivial. 9 


Lemma 5.1. The time and hardware resources needed to 
broadcast the contents of register Mla,b] columnwise 
(rowwise) is at worst the same as that needed by the 
ordinary matrix multiplication algorithm. 
Proof: To broadcast the contents of Mla,b6] columnwise, we 
perform 

Miw,6] = 2.1 ((M[w,k]*0)+(MI1K, OI k=a))) 1swsn. 
Clearly, Miw,b]=Mla,b], 1swsn. By Lemma 5.0, the lemma 
follows. Broadcasting rowwise is handled in a similar way. 0 


Lemma 5.2: Suppose there exists an ordinary matrix 
multiplication algorithm on a computer model. Then the 
all-pair shortest path of an undirected graph G with diameter 
d can be determined in O(t(n)) time with A’(n) processors on 
that model, 
where _ t'(n)=t(n)*( [logd] +1) and A’(n)=H(n) 
Proof: Construct matrix D such that 
1 if M{7,/J=1 and /#/; 
DU,/) -{ 


0 if /=/; 
+@ if Mi, /J=0. 
Compute the matrix D* as follows: 
D=D 


porens 


. DAu,vJ=min(D* [u,w]+D* "|w,v)), 721. 
Clearly, D'ju,v] contains the the shortest distance from u to v. 
containing no more than 2° edges. Therefore after flogd] 
iterations, D[u,v] contains the shortest distance from wu to v in 
G. One more iteration is required to verify that D* has been 
computed. By Lemma 5.0, t'(n)=t(n)*( [iogd] +1) and H'(in)=An).0 


Lemma 5.3: Computing the transitive closure /* takes O(t'(n)) 
time with A/(n) hardware resources. 
Proof: M/‘[a,b]=1 iff D#% [a,b]¢#+@. 9 


A detailed implementation of Algorithm 
Bridge-connectivity based on matrix multiplication. 
(1). Given the nxn adjacency matrix M of G(V,E£) where MI/,/] is 
stored in processor PE|/,/] 1s/,jsn. (* We shall construct 
a directed breath-first search spanning tree for G *) 

1.1 Compute the all—pair shortest path matrix D 


18] 


1.2 Choose a vertex 7 (Say 1) as the root. 
level[r,v] := D@ [r,v]+1. (*/eve/(v)=/eve/[r,v]}*) 
(Broadcast columnwise) /eve/[kK,v]:= /eve/[r,v] ¥veV. 
(Broadcast rowwise) /eve/[v,x] := /eve/[v,v] ¥veV. 
(* At this point, /eve/(v)=/eve/|v,v] *) 
1.3 Compute matrix F: 
F'tv,/l=Vi/evel|v,kl]=((1+/eve/[k, /)*(k=/))). 
K(x Flv, /J=1 iff /evel(v)=/evel(/)+ 1 *) 
Flv, /}.=max((F TV, KJAM[v,k])*kK+MIK, /]*0). 
(xFiv)=Flv,v], for v#r, is chosen as the father of v *) 
1.4 (* Construct the adjacency matrix 7 for the BFS 
spanning tree, and 7’, the transpose of 7 *) 
(Broadcast columnwise:) 7[k,v] := Flv,v] ¥véV. 
(Broadcast rowwise:) 7'[v,A] := Flv,v] ¥vev. 
7[w,v] := (w=7|w,v]) ¥ v,weV. 
T'[v,w] := (W=7'[v,w]) 4v,weV. 
2.1 Compute the transitive closure 7* and (7')* of 7 and 7’. 
(* Note that 7*{v,v]=1 and (7')*[v,v]=1 ¥véEV *) 
2.2 /LCAI/,/] = max{(7')1/,KM7 1K, /¥/evel{k,/))} 
(« /LCAIv,v] = /eve/(v) YvéV *) 
3.1 Construct the adjacency matrix M' of GT. 
MIi/A= MILAARATILAVTIL/D. & Miv.vi=1 ¥veV ¥) 
3.2 /HLCAT/,/] “= min {/LCAI/,A]#M'[k, /T}. 
(« /HLCA[v,v] = /HLCA(v) and /HLCAIv,v]s/eve/(v) ¥véV *) 
4.1 (Broadcast rowwise:) /HLCAIv,w] := /HLCAIv,v] wevV. 
4.2 of/,/] = min {7-[/,k]}*/HLCAIK, /]) (* Note: a[v,v]=adv) *) 
5.1 Bridgelu,v} := ViT{u, kN (af k,v]2/evel[k, vIA(k=v))). 
6.1 Compute the matrix MW’":.M'"l/, /l= MIL/\N-Bridgel/,/)). 
6.2 Compute (M")*. Q 


(2). 


(3). 


(4). 


(5). 
(6). 


The method used in step 1 to find a directed BFS 
spanning tree was first ‘implicitly’ given by Savagel 11]. It also 
appeared in [1,3]. The resource complexities are as follows: 


Step time hardware resources 

1.1 O(t'(n)) H(n) Lemma 5.2 
1.2 O(t(n)) H(n) Lemma 5.1 
1.3 O(t(n)) Hin) Lemma 5.0 
1.4 O(t(n)) H(n) Lemma 5. 1 
2.1 O(t'(n)) Hin) Lemma 5.3 
2.2 O(t(n)) H(n) Lemma 5.0 
3.1 O(1) Hi(n) trivial 

3.2 O(t(n)) Hin) Lemma 5.0 
4.1 O(t(n)) H(n) Lemma 5.1 
4.2 O(t(n)) H(n) Lemma 5.0 
5.1 O(t(n)) H(n) Lemma 5.0 
6.1 O11) H(n) trivial 

6.2 O(t'(n)) H(n) Lemma 5.3 

Theorem 5.4: Algorithm Bridge-—connectivity takes 


Olt(n)*( logd + 1))time with A/(n) hardware resources. 


A detailed implementation of Algorithm Biconnectivity 
based on matrix multiplication. 
Steps (1)-(3). Same as Steps 
Bridge-connectivity. (* After step 3, 
/HLCAIv,v]=/HLCA(v), 4v,XeV *) 

(4). 4.1 (Broadcast rowwise:) /HLCAIv,A] := /HLCAIv,v], ¥vevV. 
(Broadcast columnwise:)/eve/'[k,v]:=/eve/|v,v], ¥veV. 
(Broadcast columinwise:) /HLCA'[A,v]:=/HLCAIv,v] ¥veV. 

4.2 Construct an adjacency matrix M”" for G'"(E,£"). 

M"lu,v}=(7')[u,v] 4 (/eve/'[u,v]>/HLCAIu,v)). 
M"lu,v):=M"[u,v] Vv (7+Ha,v] A (/eve/[u,v]>/HLCA'Tu,v))) 
M"\u,v)=M'"[u,v] v M'u,v) 

(* Since each v uniquely determines F(v), we use v to 

represent (F(v),v) in the vertex set of G” *) 

5.1 Compute (M")*. 

5,2 (* F(v)=Fl[v,v], ¢veV after step 1.4 *) 

(Broadcast columnwise:) FIk,v] := Fiv,v], ¥vev. 
Compute the matrix subroot. ¥ v#r: 

subrootlv, /] := (ring) {(M")'[v,k]*F lv, k})+(F A, / #0) 5. 

(« subroot|v,k], ¥kéV, contains the root of the 
subtree (of the BFS spanning tree) containing v *) 

5.3 ¥v4r. Clv,wh= mini") v,4}*k)+(M")*1k, w]*0))— (03). 

BClv,w] = (')'[v.w]Nw=Clv,w)). 

5.4 ¥veV such that v=Civ,w] and v#r : 

(Broadcast columnwise:) subroot[k,v}:=subroot|v,v]; 
BC\w, v}=BCiw,v]Mw=subroot|w,v)); 

(« BClu,v]=BCiw,v]=1 iff u and w belong to the same 
biconnected component represented by v in G *) 

(* F/agiw,vi=0 ¥w,vev initially *) 
Flag|w,v]=(w=subroot|w,v)).. 


(1)-(3) of Algorithm 
level|lv,k] =/evel(v), 


(5). 


5.5 Flagsumlu,v}-= 22 (Flaglu,k}+(F/agik,v}*0))) 
Flagsumi{v,v}>0, vé#r; 
Sptvvi=| 
Flagsumliyv,v]>1, v=r. 


(* Sptlv,v]=1 iff v is a separation vertex *) 9 


The correctness is easily verified. the resource 


complexities are as follows: 


Step time hardware resources 
(1) to (3) O(¢'(n)) Hn) 
4.1 O\(tin)) H(n) Lemma 5.1 
4.2 O(1) H(n) trivial 
5.1 Oltin)*logd") Hn) Lemma 5.3 
5.2 O(t(n)) H(n) Lemmas 5.0,5. 1 
5.3 O(t(n)) Hin) Lemma 5.0 
5.4 Olt(n)) H(n) Lemmas 5.0,5. 1 
5.5 O(t(n)) H(n) Lemma 5.0 
(x d” is the diameter of G', 1Sd"<n. ¥) 
Theorem 5.5: Algorithm Biconnectivity takes 


Olt(n)*(max(logd,logd”)+ 1)) time and H(n) hardware resources. 


6. Implementation on Existing Models 

The models we consider are the following: 

MCN(VLSI) : Mesh connected Networks! 1,3]. 

PSN(VLSI) : Perfect Shuffie Networks[3, 14]. 

CCC(VLSI) : Cube Connected Cycies[9]. 

OTN(VLSI) : Orthogonal Tree Networks[8]. 

OTC(VLSI) : Orthogonal Tree Cycles[8]. 

SIMD-CCC  : SIMD Cube Connected Computers{[3]. 

SIMD-SM-R : SIMD Shared Memory model with read 
conflicts permitted[ 11,12]. This includes the P—RAM 18]. 

SIMD-SR-RW : SIMD Shared Memory Model with read and 
write conflicts permitted [5, 13]. 


From the references cited above and Theorems 5.4, 
5.5, itis not difficult to verify the results stated below: 


algorithms on various existing models. 


model time chip area AT’ # of processors 
MCNIVLSI) O(n) r O(n") ceagcee 
PSN(VLSI) Ollog?nioglogn) n&/logn Olnlog*niog2logn) --- [7] 
CCC(VLSI) Ollog?n) rilogn Ol(nlog?n) mien © 8 x 
OTN(VLSI) Ollog?nlogiogn) nlog*?n O(n log*nliog?logn)—-—— [8]* 
OTC(VLS!) Ollog3n) n O(n3log*n) mele 
SIMD-CCC Ollog?n) --- --- n{nilognl [3k 
SIMD-SM-R Ollog?n) aad ras (n3/logn| [12] 

or Ollog?n) ae Sao mn+r [18 
SIMD-SM-RW Ollogn) —--- ae n> [5)}* 

or Ollogn) Se ae m+2nm [13h 


(x Note: ™ is the number of edges in the graph; * indicates 
that the simple-minded algorithm based on_ the 
graph-connectivity algorithm or the matrix multiplication 
algorithm applies *) 


AOA, wee mabe 6 cetera 


algorithms on various existing models. 


model time chip area AT’ # of processors 
MCN(VLSI) O(n) Mm O(n’) ——-— [19] 
PSN(VLSI) Ollogn*l) né/log3n OlL?xn’/logn) --- [3] 
CCC(VLSI) Ollogn*l) né/log?n  Olns*L?) ~-- [9] 
OTN(VLSI) Ollogn*Z) nm'log?n O(n'log*n*L?) ——— [7] 
OTC(VLS!) Ollogr*Z) nr Olnlog?n*L?) --- [7] 


SIMD-CCC  Ollogr*L) ae ees 
SIMD-SM-R Ollogn*Z) = 
SIMD-SM-RW O(L) 
(* £ = logd+1 for bridge-connectivity; 
=max(logd,logd")+1 for biconnectivity. 1<d,d'"<n. *) 
The efficiency of our algorithms should be evident. 


In3/lognl [3] 
[n/lognl [11] 
n’ [5] 


7. Conclusion 

In Section 1, we indicated that the time and hardware 
resource bounds of our algorithms are bounded above by 
O(t(n)*L) and H(n) respectively. This means that our algorithms 
have the potential of achieving other good compiexity 
bounds if more elaborate techniques are used. As a matter 
of fact, it has been shown that: (i) for the SIMD-SM-R 
model, our algorithms could acheive the Oliog’n) time bound 
using only n[n/log?n| processors. This result is optimal for 
dense graphs/15]; (ii) for the conventional sequential 
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computers, our algorithms could run in optimal time and 
spacel 16]. It could be shown that using Reif’s recent result 
on the minimum spanning forest[10], our algorithms could 
run in Ollogn) time with probabilistic error €, O<€<1, using 
lElmlogn processors. Similarly, using Awerbuch = and 
Shiloach’s recent result on the minimum spanning forest[20], 
our algorithms could run in Ollogn) time with [n/log/] 
processors on SIMD-SM-RW. 


The strategy we use in this paper, when applied to 
Savage and Ja’Ja’s’ algorithms[12] on any of the computer 
models other than the SIMD-SM-R, does not result in 
algorithms better than the simple-minded algorithm. 
However, we do observe that Atallah and Kosarajus’ 
algorithms[ 1] could adopt our strategy and achieve the same 
hardware resource bound and a slightly higher time bound as 
ours on all the computer models discussed in the preceding 
section. However, their algorithms. are inferior to ours for 
the following reasons: Besides being more complicated, their 
algorithms rely on the algorithm for finding the transitive 
closure of a ‘directed’ graph and thus has the same time and 
resource bounds as that algorithm. As the resource bounds 
of the transitive closure problem are difficult to improve, so 
are the bounds of their algorithms. In fact, we do not think 
that their algorithms can be modified to acheive the optimal 


bounds we have achieved on the SIMD-SM-R and sequential 


models. Perhaps even more importantly, their algorithms do 
not lead to Ollogn) time probabilistic algorithms for the 
P—-RAM as ours do. 
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Anomalies In Parallel Branch-and-Bound Algorithms* 


Ten-Hwang Lai 


The Ohio State University 


Abstract 


We consider the effects of parallelizing branch-and- 
bound algorithms by expanding several live nodes simul- 
taneously. It is shown that it is quite possible for a 
parallel branch-and-bound algorithm using nz proces- 
sors to take more time than one using m1 processors 
even thoughn,; <ns. Furthermore, it is alsopossible t> 
achieve speedups that. are in excess of the ratio ne/n, 
Experimental results with the 0/1 Knapsack and Trave' 


ing Salesperson problems are also presented, 


Key Words and Phrases 


Parallel Algorithms, 


branch-and-bound, anomalous 


behavior, 


j. Introduction 


Branch-and-bound is a popular algorithm design tech- 
nique that has been successfully used in the solution of 
problems that arise in various fields (e.g., combinatorial 
aptimization, artificial intelligence, etc.) [1, 6-12]. We 
gall briefly describe the branch-and-bound method as 
used in the solution of cornbinatorial optimization prob- 


lems. Our terminology is from Horowiz and Sahni [7]. 


In a combinatorial optimization problem we are 
required to find a vector x = (%;, Z2g,...,Z,) that optim- 
ites some criterion function f(x) subject to a set C of 
constraints. This constraint set may be partioned into 
two subsets: explicit and implicit constraints. Implicit 
constraints specify how the z,s must relate to each 


ather. Two examples are: 
1) 5° aya; <b 
i=l 


2) arf — ogre, + agrf =6 


Explicit constraints specify the range of values 


each z; can take. For example: 


* This research was supported in part by the Office af 
Naval Research under contract NO0014-B0-C-(@50. 
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and Sartaj Sahni 
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1) ay € $0, 13 
2) xy >0 


‘he set of vectors that satisfy the explicit con- 
traints defines the solution space. In a branch-and- 
bound approach this solution space is organized as a 
graph which is usually a tree. This resulting organiza- 
tion is called a stafe space graph (tree). All the state 
space graphs used in this paper are trees. So we shall 
henceforth only refer to state space trees. Figure } 


shows a state space tree for the case n = 3 and x, & 0, 


13. The path from the root to some of the nodes (in this 
case the leaves) defines an element of the solution 
spacc, Nodes with this property arc called selution 
nodes. Solution nodes that satisfy the implicit con- 
straints are called feasible solution nodes or answer 
nodes, Answer nodes have been drawn as double circles 
in Figure i, The cost of an answer node is the value of 
the criterion function at that node. In solving a com- 
binatorial optimization problem we wish to find a least 


cost answer node. 


Figure i A state space tree 


For convenience we assurmic that we wish to minim- 


ize f(x). With every node N in the state space tree, we 
associate a value f min(N) = min§f(Q) : Q is a feasible 


solution node in the subtree N}. (If there exists no such 
Q, then Iet f minfN}) = ~.) 


While there are several types of branch-and-bound 
algorithms, we shall be concerned only with the more 
In this 
method a heuristic function g{ ) with the following pro- 


popular least cost branch-and-bound (lcbb). 


perties is used: 


(P1) g(N) < f min (N) for every node N in the state space 
tree. 


(P2) g(N) = £(N) for solution nodes representing feasible 


solutions (i.e., answer nodes). 


(P3) g(N) = ~ for solution nodes representing infeasible 
solutions. 


(P4) g(N) = g(P) if Nis a child of P. 


g( ) is called a bounding function, lcbb generates 
the nodes in a state space tree using g{ ). A node that 
has been genereated, can lead to a feasible solution, 
and whose children haven't yet been generated is called 
a live node. A list of live nodes (generally as a heap ) is 
maintained. In each iteration of the Icbb a live node, N, 
with least g{ ) value is selected. This node is called the 
current fimode. If N is an answer node, it must be a 
least cost answer node. If N is not an answer node, its 
children are generated. Children that cannot lead to a 
least cost answer node (as determined by some heuris- 
tic) are discarded. The remaining children are added to 


the list of live nodes. 


The problem of parallelizing icbb has been. studied 
earlier [2 - 5, 13]. There are essentially three ways to 
introduce parallelism into lcbb: 


(1) Expand more than 1 E-node during each iteration. 
(2) 


(3) 


Evaluate g{ ) and determine feasibility in parallel. 


Use parallelism in the selection of the next E- 
node(s), 


Wah and Ma [13] exclusively consider (1) above 
(though they point out (2) and (3) as possible sources of 
parallelism), If p processors are available then q = 
min{p, number of live nodes} live nodes are selected as 
the next set of E-nodes (these are the q live nodes with 
smallest g{ ) values). Let gpm be the least g value 
among these q nodes. If any of these E-nodes is an 
answer node and has g{ ) value equal to g,, then a least 


cost answer node has been found. Otherwise all q E- 


. nodes are expanded and their children added to the list : 


of live nodes. Each such expansion of q E-node counts 
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as one iteration of the parallel Icbb. For any given prob- 
lem instance and g, let I(p) denote the number of itera- 
tions needed when p processors are available. Intuition 


suggests that the following might be true about I(p): 
(11) I(7,) = I{neg) whenever n, < ne 


(12) 


(ny) _ Me 
I(ng) my 


In Section 2, we show that neither of these two rela- 
tions is in fact valid. Even if the g{ )s are restricted 
beyond (P1) - (P4), these relations do not hold. The 
experimental results provided in Section 3 do, however, 
show that (11) and (12) can be expected to hold "most" 


of the time. 


Wah and Ma [13] experimented with the vertex 
cover problem using 2°, 0 < k < 6 processor. Their 
results indicate that I(1)/I(p) ~ p. Our experiments 
with the 0/1-Knapsack and Traveling Salesperson prob- 


lems indicate that 1(1)/I(p) ~ p only for “small" values 
of p (say p< 16). 


2. Somc Thcorcms For Paralicl Branch-and-Bound 


As remarked in the introduction, several anomalies 
occur when one parallelizes branch-and-bound algo- 
rithms by using several E-nodes at each iteration. In 
this section we establish these anomalies under varying 
constraints for the bounding function g{ ). First, it 
should be recalled-that the g({ ) functions typically used 
(eg. for the knapsack problem, traveling salesperson 


problem, etc. cf. [7] ) have the following properties: 


(a) g(N) = g(M) whenever N is a child of node M. Thus, 
the g{ ) values along any path from the root to a 


leaf form a nondecreasing sequence. 


several nodes in the state space tree may have the 
same g{ ) value. In fact, many nonsolution nodes 
may have a g{ ) value equal to [*. This is partiotu- 
larly true of nodes that are near ancestors of solu- 


tion nodes. 


In constructing example state space trees, we shall 
keep (a) in mind. None of the trees constructed will 
violate (a) and we shall not explictly make this point in 
further discussion. ‘I'he first result we shall establish is 
that it is quite possible for a parallel branch-and-bound 
using m2 processors to perform much worse than one. 


using a fewer number n, of processors. 


Theorem 1: Let n; < 72. For any k > O, there exists « 
problem instance such that kI{n,) < I{mg). 
Proof: 


space tree of igure 2. All nonleaf nodes have the same 


Consider a problem instance with the state 


g() value equal to f*, the f value of the least cost answer 
node (node A). When n, processors are available, one 
processor expands the root and generates its n, + 1 
children. Let us suppose that on iteration 2, the left n, 
nodes on level 2 get expanded. Of the n, children gen- 
erated m, — 1 get bounded and only one remains live. 
On itcration 3 the remaining live node on level 2 (B) and 
the one on level 3 are expanded. The level 3 node leads 
to the solution node and the algorithm terminates with 
I(n 1) = 3. 


3x-1 
A . > P levels 
4 e . 


e1en® 


Figure 2: Instance for Theorem 1 


When 7g processors are available, the root is 
expanded on iteration 1 and all 7, + 1 live nodes from 
level 2 get expanded on iteration 2. The result is ng + 1 
live nodes on level 3. Of these, only nz can be expanded 
on iteration 3. These mz could well be the rightmost neg 
nodes. And itcrations 4, 5, .... 3k could very well be lim- 
ited to the rightmost subtree of the root. Finally in 
iteration 3k + 1, the least cost answer node a is gen- 
erated. Hence, I{mzg) = 3k + 1 and kI(n,) <I{ng).  [] 


In the above construction, all nodes have the same 
g( ) value, f*. While this might seem extreme, property 
(b) above states that it is not unusual for real g- 
functions to have a value f* at many nodes. The exam- 
ple of Figure 2 does serve to illustrate why the use of 
additional processors may not always be rewarding. The 
use of an additional processor can lead to the develop- 
ment of a node N (such as node B of Figure 2) that looks 
“promising” and eventually diverts all or a significant 


number of the processors into its subtree. When a 
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fewer number of processors are used, the upper bound 
U at the time this "promising" node is to get expanded 
might be such that U < g(N) and so N is not expanded 


when a fewer number of processors are available. 


The proof of Theorem 1 hinges on the fact that g{N) 
may equal f* for many nodes (independent of whether 
these nodes are least cost answer nodes or not). If we 
require the use of g-functions that can have the value f[* 
only for least cost answer nodes, then Theorem 1 is no 
longer valid for all combinations of n; and mg, 2; < Np. 
In particular, ifn, = 1 then the use of more processors 


never increases the number of iterations (Theorem 2). 
Definition: A node N is critical iff g(N) < f*. 


Theorem 2: If g{N) # f{* whenever N is not a least cost 


answer node, then I(1} = I{n) for n > 1. 


Proof: When the number of processors is 1, only critical 
nodes and least cost answer nodes can become E-nodes 
(as whenever an E-node is to be selected there is at 


least one node N with g(N) < f* in the list of live nodes) 
Furthermore, every critical node becomes an E-node by 


the time the branch-and-bound algorithm terminates. 


Hence, if the number of critical nedes is m, I{1} = m. 


When n > 1 processors are available, some noncriti- 
cal nodes may become E-nodes. However, at each itera- 
tion, at least one of the E-nodes must be a critical node. 
So, I(n)<m. Hence, I(1)=I{n). [J 


When mn, # i, a degradation in performance is pos- 
Sible with ng > nm, even if we restrict the g({ )s as in 


Theorem 2. 


Theorem 3: Assume that g{N) ~ f* whenever N is not a 
least cost answer node. Let 1 < n,; < mg and k > 0. 
There exists a problem instance such that I{m,) +k < 


I(ng). 


Proof: Figures Ua) 


trees T. Assume that all nodes have the same g( ) value 


and 3(b) show two identical sub- 


and are critical. The numbers inside each node give the 
iteration number in which that node becomes an E-node 
when 7m, processors are used (Figure 3(a)) and when ng 
processors are used (Figure 3(b)). Other evaluation 
orders are possible. However, the ones shown in Figures 


3(a) and 3(b) will lead to a proof of this theorem. 


We can construct a larger state space tree by con- 
necting together k copies of T (Figure 3{c}). The B node 
of one copy connects to the A node (root) of the next. 
Each triangle in this figure represents a copy of T. The 
least cost answer node is the child of the B node of the 
last copy of T. It is clear that for the state space tree of 
Figure 3(c), Ifm,) = jk while l{mg) = (j + 1)k. Hence, 
n,)+k=Kn2). (J 


The assumption that g(N) # f* when N is not a least 
cost answer node is not too unrealistic as it is often pos- 
sible to modify typical gf )s so that they satisfy this 
requirement, The example of Figure 3 has many nodes 
with the same g{ ) value and so we might wander what 
would happen if we restricted the g( )s so that only least 
cost answer nodes can have the same g( ) value. This 
restriction on g( ) is quite severe and, in practice, it is 
often not possible to guarantee that the g( ) in use 
satisfies this restriction. However, despite the severity 
of the restriction one cannot guarantee that there will 
be no degradation of performance using ng processors 
when 7, <2 < 2(n; —1). We have unfortunately been 
unable to extend our result of Theorem 4 to the case 
when mg = 2(n, - 1). So, it is quite possible that no 
degradation is possible when the number of processors 
is (approximately) doubled and g{ ) is restricted as 


above. 


Theorem 4 Let n, <ne < 2(n,- 1) and let k>0O. There 
exists a g{ ) and a problem instance that satisfy the fol- 


lowing properties: 


(a) g({N,) # g(Ne) unless both of NV, and Ne are least 


cost answer nodes. 


(b) I{n,) +k < I(myg). 


Proof: Consider the state space tree of Figure 4{a). The 
number outside each node is its g{ ) value while the 
number inside a node gives the iteration in which that 
node is the E-node when 7, processors are used. It 
takes nm, processors 4 iterations to get to and evaluate 
node B, When wg processors are available, Tiy < Tig < 
a(n, - 1), the iteration numbers are as given in Figure 
4{b). This time 5 iterations are needed. Combining k 
copies of this tree and setting the g{ ) values in each 
copy to be different from those in other copies yields 
the tree of Figure 4(c). For this tree, we see that I(n,) 
+k=I(ng).  [] 
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Figure 4: Instance for Theorem 4 


The remaining results we shall establish in this sec- 
tion are concerned with the maximum improvement in 
performance one can get in going from nm, to mg proces- 
sors, 7m, <Mmg. Generally, one would expect that the per- 
formance can increase by at most mz / n,. This is not 


true for branch-and-bound. In fact, Theorem 5 shows 


that using g( )s that satisfy properties (a) and (b), an 
unbounded improvement in performance is possible. 
The reason for this is much the same as for the possibil- 
ity of an unbounded loss in performance. The additional 
processors might enable us to improve the upper bound 
quickly thereby curtailing the expansion of some of the 
nodes that might get expanded without these proces- 


sors. 


Theorem 5: Let nm, < mz. For any k > ne/m,, there 
exists a problem instance for which J(m,)/J(mpg) = k > 


Tlo/ N24. 
Proof: See [14]. = 


As in the case of Theorern 2, we can show that when 
e(N) # f* whenever N is not a least cost answer node, 
1(1)/I(n) < n. 


Theorem 6 Assume that g(N) # f* whenever N is not a 


least cost answer node. I{1)/I{(n) <nfor n> 1. 


Proof: From the proof of Theorem 2, we know that I(1) 
= m where m is the number of critical nodes. Since all 
critical nodes must become E-nodes before the branch- 
and-bound algorithm can terminate, I{n) = m/n. Hence, 
I1)/In)<n. {] 


When 1 <n, < nz and g(N) is restricted as above, 


I(n ,)/I(nmzg) can exceed nz2/n, but cannot exceed ng. 


Theorem 7: Assume that g(N) # f* whenever N is nota 

least cost answer node. Let 1 <n, < 72. The following 

are true: 

(1) I(7,)/I(ng) < ne. 

(2) There exists a problem instance for which 
I(n,)/I(ng) > ne/ ny. 


Proof: (1) From Theorems 2 and 6, we immediately 


obtain: 


Mma). Fm), 1) 
Tne) F(1) I(ne) 


(2) See [14]. « 


In order to determine the frequency of anomalous 
behavior described in the previous section, we simu 


lated a parallel branch-and-bound with 2* processors 
for k = 0, 1, 2,..., 9. Two test problems were used: 0/1- 
Knapsack and Traveling Salesperson. These are 
described below. 


O/1-Knapsack: 


In this problem we are given n objects and a knapsack 
with capacity M. Object i has associated with it a profit 
p, and a weight w,. We wish to place a subset of the n 
objects into the knapsack such that the knapsack capa- 
city is not exceeded and the sum of the profits of the 
objects in the knapsack is maximum. Formally, we wish 
to solve the following problem: 


matimize 3 Piri 
i=] 


subject to oe <M, x & $0, 1}. 
=1 


(a) binary tree 


Figure 5 


(b) 3-ary tree 


Horwitz and Sahni [7] describe two state space 
trees that could be used to solve this problem. One 
results from what they call the fixed tuple size formula- 
tion. This is a binary tree such as the one shown in Fig- 
ure 5(a)for the case n = 3. The other results from the 
variable tuple size formulation. This is an n-ary tree. 
When n = 3, the resulting tree is as in Figure 5(b). The 
bounding function used is the same as the one 
described in [7]. Since the bounding function requires 
that objects be ordered such that p / wy >= yas / Wy, 
1<i <n, we generated our test data by first generating 
random w,s. The ps were then computed from the ws 
by using a random nonincreasing sequence ff, fa, ..., 
fx and the equation p; = f;w;. We generated 100 
fmstances with n = 50 and 60 instances with n = 100. 
These 160 instances were solved using the binary state 


space tree described above. (We also tried the n-ary 
state space tree but found that it would take several 
weeks of computer tirmne to complete our simulation. 
The reason it will take so much timc is that when n-ary 
state space trees are used a great number of nodes will 
be generated and the queue of live nodes will exceed the 
capacity of main memory and has te be moved to the 
secondary storage. In our program, it is time consum- 
ming to maintain a queue of live nodes that must be 


partly stored in secondary storage.) 


Lene | aR ee eee «ee 
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Table L Experimental results (knapsack) 


Table 1 gives the average values for I{p), 1(1)/1(p) 
and I({p)/I(2p). From Table 1, we see that when n = 50, 
i(1)/I(p) is significantly less than p for p > &. The 
observed improvement in performance is not as high as 
one might expect. Similarly, the ratio I(p)/I(2p) drops 
rapidly to 1 and is acecptable only for p = 1 and e (sec 


also Figure 6). 


In none of the 100 instances tried for n = 50 did we 


observe anomalous behavior. l.e., it was never the case 
that I(p) < 1(2p) or that I{p) > 21(2p). 

When n = 100, the ratio I(1)/I{(p) is significantly less 
than p for p > B (see also Figure 7). Of the 60 instances 
run, 6 (or 10%} exhibited anomalous behavior. For all 6 
of these there was at least one p for which I{p) > 21(2p). 
There was only one case where I{p) < I{2p). The values 
of I{p), I(1)/I{p), and I(p)/1(2p) for these six instances is 
given in Table 2. It is striking to note the instance for 
which I(1)/I(2) = 14.6 and I(2)/1(4) = 0.15. 


The Traveling Salesperson Prablem: 


Here we are given an n vertex undirected complete 
graph. Bach edge is assigned a weight. A tour is a cycle 
that includes every vertex (ie., it is a Hamiltonian 


cycle). The cost of a tour is the sum of the weights of 


, Pp 
1 2 4 8 16 32 64 128 256 512 
pz: number of processors 

I{o 
LV2D 
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1 2 ky 8 16 32 64 128 256 512 


p: number of processors 


Figure 6 


Knapsack with 50 objects 
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Knapsack with 100 objects 


Figure 7 
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1 
z 
4 
B 
6 
z 


CG) 


Table 2: Data exhibiting anomalous behavior 


the edges on the tour. We wish to find a tour of 


minimum cost. 


The branch-and-bound strategy that we used is a 
simplified version of the one proposed by Held and Karp 
[6]. Vertex 1 is chosen as the start vertex. There are n 
- 1 possibilities for the next vertex and n - 2 for the 
preceding vertex (assume n > 2). This leads to (n- 1)(n 
- 2) sequences of 3 vertices each. Half of these may be 
discarded as they are symmetric to other sequences. 
Any sequence with an cdge having infinite weight may 
also be discarded. Paths are expanded one vertex at a 
time using the set of vertices adjacent to the end of the 
eer ig) is 
obtained by computing the cost of the minimum span- 


path. A lower bound for the path (i, ig, 
ning tree for {1, 2, ..., n} - {4), de, ..., %} and adding an 
edge from each of 7,, and % to this spanning tree in 
such a way that these edges connect to the two nearest 


vertices in the spanning tree. 


In our experiment with the traveling salesperson 


problem we generated 45 instances each having 20 ver- 


tices. The weights were assigned randomly. However, 
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each edge had a finite weight with probability 0.35. Use 
of a much higher probability results in instances that 
take years of computer time to solve by the branch- 


and-bound method. 


Those 45 instances were solved using p = 2*,0<k 
<= 9 processors. The average values of I{p), I(1)/I(p), 
and I(p)/I(2p) are tabulated in Table 3. As can be seen, 
for p < 32 the average value of 1{1)/I({p) is quite close to 
p and the average value of I{p)/I(2p) is quite close to 2 
(see also Figure 8). No anomalies were observed for 


any of these 45 instances. 


Table 3: Experimental results (traveling salesperson) 


4. Conclusions 


We have demonstrated the existence of anomalous 
behavior in parallel branch-and-bound. Our experimen- 
tal results indicate that such anomalous behavior will be 


: P ys : x Thain 2 be it_ 
rarely witnessed in practice. Furthermore, there is lit 


ee 


tle advantage to expanding more than k nodes in paral- 
lel. k will in general depend on beth the problem and 
the problem size being solved. If we require I{p)/I({2p) 
to be at least 1.66, then for the kKnacksack problem with 
n = 50, k is between 4 and B whereas with n =100 it is 
between B and 16 (based on our experimental results). 
For the traveling salesperson problem with 20 vertices k 
is between B and 16. Hf p is larger than k, then more 
effective use of the processors is made when they are 
divided into k groups each of size approximately p/k. 
Bach group of processors is used to expand a single E- 
node in parallel. If s is the speedup obtained by expand- 
ing an E-node using g processors, then allocating q pro- 
cessors to each -node and expanding only p/q E-nodes 
in parallel is preferable to expanding p E-nodes in paral- 
le] provided that sI(1}/I(p/g) > I€1}/1(p). 
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Figure 8 Traveling salesperson 
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Experience with Two Parallel Programs Solving the Traveling Salesman Problem 
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Abstract: Uhe traveling silesman problem is solved on Cm*, a multiprocessor 
system, using (wo parallel search progranis based on the branch and bound algorithm 
of Lattlhe, Murty, Sweeny and Karel. One of these programs is synchronous and has a 
master-slave process structure, while the other is asynchronous and has an egalitarian 
structure. The absoltite execuuon times and the speedups of the two programs differ 
smificantly. Their execution times differ because of the difference in their process 
structure. ‘Their speedups differ because they require different amounts of 
computation to solve the same problem. This difference in the amount of 
compulauon is explained by their different heurisic granulariuics. The difference 
between the speedup of the asynchronous second program and lincar speedup is 
atuibuted to processors idling owing to resource contention, 


1. Introduction 

The traveling salesman problem (rsp. for short) is to find the round-trip tour that 
visits each of Vcities once. for the minimum cost, given an AMX AN matrix of the costs 
of traveling from one city to another. There are several variations to the problem. 1 
chose to lind the exact solution to the asymmetric non-euchdean Tsp because it is the 
most general of all variations. 


Many algorithms based on the techniques of dynamic programming and branch 
and bound solve the isp. The programs studied here are based on the branch and 
bound algorithm of Tittle. Murty. Sweeny and Karel (3). This algorithm will be 
called the iMsk algorithm nereafier. The two programs that were implemented lie on 
sienificanth different points in the implementation spectrum so that a comparative 
analysis. of their performance will shed some light on the relation between 
performance and paiallel program attributes. An carlier technical report [4] contains 
more details on the algorithm and the implementations. 


The experiments were conducted on the multiprocessor system Cm* [6], running 
the operating system StarOS [ij]. Cm* hardware consists of fifty Digital Fguipment 
LSE T's. Palys, 
hierarchical, distributed switching structure 


These processor-memory called Cni's. are connected by a 
(m* is paruitioned into five clusters of 
apy te 14 Cr’s each: the Cnvs in an individual cluster are connected via a map bus toa 
mapping processor, callea a Amap, through whieh they communicate with each 
oer. Any Cm can 
However the cluster structure induces an 


access Hierarchy: typically, access £0 the memory of another Cm in the same cluster 


‘The clusters themselves are connected via drrercluster busses. 
reference micmory unywhere in (Pe system 


costs about three times the aceess te the Cnvs own (local) memory and access to 
remote clusters costs about twelve tunes PP}. StarOS is an object-oriented, message- 
based operating system io support collecuons of processes that cooperate to solve 
problems. 


2.LMSK Algorithm 

The iMsx alaorithm works by parttuoning the set of all possible tours into 
progresstvcly smaller subsets, which are represented on the nodes of a state-space 
irec. and then expanding the state-space wee incrementally toward the goal node 
using heuristics to guide the search. The algorithm uses two heuristics to guide its 
search toward the solution node The node-sclection hcurisiie chooses, from among 
all the leaf nodes of the current tree, that Icaf node whose estimated lower-bound 
tour cost is the least. The cdge-sefection ieurrstic computes the increments in tour 
cost when different edges are excluded from the tour, and chooses the cdge that 
causes the maximum increment. The description of the algorithm below explains 
how the nodes and edges thus chosen are used. 


Starting with a trce consisting of just the root node, which represents the set of ali 
tours. the algorithm repeated!/y executes these sicps in sequence: it first chooses, from 
all the leaf nodes of the current tree, the node that is most likely to lead to the 
sohiuion (using the node-selection heuristic) and designates it to be the next 
expansion node: it then chooscs one or more edges (legs of a tour) using the 
edge-selection heuristic: it sets up the child nodes with all the different combinations 
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of inclusion and exclusion of the selected edges, and finally it computes for each 
child node the Jower-bound cost of all tours in the subset defined by the child node. 
‘The algorithm terminates when a leaf node is found that represents a single complete 
tour with a cost less than or equal to the lower-bound costs of all possible tours, 
which are represented by the leaf nodes of the current state-space tree. 


Ifere is a high level representation of the algorithm: 
repeat 


I: Select node in state-space tree using node-selection heuristic; 
2: Select one or more cdges using cdec- selection heuristic; 


3: For each child corresponding to one inclusion/exclusion 


combination of the selected edges, 


3.1: Create a node and link it to tree; 

3.2; Derive cost matrix for child node; 

3.3: Reduce matrix and find new lower bound for all tours 
defined by child node; 


until a full tour, with a cost less than or equal to the lower bounds on all 
possible tours, is obtdined. 


3. implementations 

Onc technique for adapting an algorithm for parallel execution is unfolding a loop 
and letting multiple processes work on different iterations of the unfolded loop. ‘This 
technique is adopted here in two different ways to get two parallel programs, which 
unfold different loops of the algorithm. Onc program, isp), unfolds the for loop that 
sets up child nodes (step 3 of algorithm in previous section), while the other, Tsp2, 
unfolds the outcrmost (repear) loop. 


The first program, rsp), is a syachronous master-slave program. The master process 
implements the outer loop of the algorithm, sicps | and 2, and the loop control of 
step 3. Phe slave processes (one for each child) implement the steps within the inner 
loop, that is, steps 3.1, 3.2, and 3.3. Vo obtain a parallelism of NV during an execution, 
log, NV cdges are selected by the master during cach iteration, causing N child nodes 
to be sect up by NM slaves, one child node by cach slave. 


The other program, rsp2, is implemented as a collection of asynchronous 
cooperating processes with no master-slave relationship among them: such a process 
structure is referred to here as egalitarian. tach process, during cach of its iterations, 
selects one edge and sets up two child nodes: one node including the edge in the tour, 
and the other excluding it. Fach process executes all the steps in the algorithm and 
does so repeatedly until it is determined by consensus that a tour has been found. 


4. Speedup 

In this section the speedups of the two programs will be presented. Speedup is the 
raiio of serial execution time to paralicl execution time. It reflects the effective 
parallelism achieved at a given nominal parallelism. 


4.1. TSP1 

Figure 1 plots specdup against parallelism for rsp1. As explained before, the 
degree of parallelism for this program can assume only values that are powers of 2. 
However data points in all the figures in this section are shown interpolated to depict 
trends of smoothened values. Specdup at a given parallelism is computed as 


2 * solution time for parallelism of two / solution time for given parallelism. 


(This shghtly non-standard definition of speedup is adopted here for practical 
reasons.) The solution time is taken to be the clapsed time for solving the problem, 
excluding the ume spent creating the slave processes. The execution times vary 
depending on the distribution of processes among the clusters because of the 
hierarchical memory access structure of Cm*. ‘To factor out this phenomenon, only 
the best times, among all the different placements for, which the experiments were 
performed. were used when calculating speedups. 


‘The dashed line shows linear specdup (speedup cquals parallelism) and the solid 
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Figure 1: Specdup versus parallelism for TSP1 


line plots the actual speedup achieved for this program. The actual speedup is 

‘ reasonable between parallelisms of 2 and 4 (speedup at a parallelism of 4 is 2.8). 
However it starts going down after a parallelism of about 6, and remains constant at a 

- speedup of 2.6 after a paraliclism of about 8. This droop is explained by the greater 
total amount of work donc at a greater parallelism for this program and by the 
saturation of system bottlenecks when greater amount of simultancous computation 
occurs in the system. The dot-dashed line in the graph factors out the effect of the 
greater amount of work donc with greater paraliclism. This curve would have been 
the speedup curve of this program, if the total computation for solving the problem 
did not increase with incrcasing parallelism. “The primary work of the algorithm 
consists of setting up and reducing the cost matrix corresponding to the nodes of the 
state-space tree. 
computation performed for soiving the Tsp. Assuming that total computation for 
solution was directly proportional to the number of nodes generated, 


spcedup adjusted for nodes generated corresponding to parallelism N 
= 2 * solution time per unit computation with 2 processes / 


solution time per unit computation with N processes 


= 2 * (solution time with 2 processes / number of nodes generated 
with 2 processes) / (solution time with N processes / 
number of nodes gencrated with N processes) 


The curve corresponding to the adjusted speedup behaves much beticr; it is a lot 
nearer to the linear speedup linc, docs not show any signs of peaking or saturation 
and continues to increase reasonably till a parallelism of 16, which was the maximum 
parallelism of the experiment. The difference between the linear speedup line and 
this curve corresponds to loss of processor time owing to the synchronous control of 
the program and to contention for system resources such as the Kmaps, various 
' system busses, SLocals, and the Object Manager. 


4.2. TSP2 
Figure 2 plots speedup against parallelism for Tsp2. Since for this program single 
process execution is possible, actual speedup at a given parallelism is computed as 


solution time for parallelism of one / solution time for given parallelism. 


‘The speedup curve looks much better than that for Tsp1. It is close to the linear 
 specdup line until a parallelism of 6 and docs not show any signs of peaking or 
- Saturating at higher parallelisms. It is almost identical to the speedup adjusted for 
. nodes gencrated in Tsp1. This adjusted speedup was what the system was capable of 
_achicving, if the total computation for solving the problem had remained constant 
- with increasing parallelism. This close correspondence of the two curves suggests 
that the amount of computation does not increase with paralliclism here. This 
- inference is reinforced by the number of nodes gencrated by Tsp2 remaining nearly 
constant for all parallclisms. 


The absolute exccution times for Tsp2 are lower than for TsPi by about 25% at a 
parallelism of 2 and by about 80% at a parallelism of 16. For a parallelism of 2, the 
nodes generated and the nodes expanded for both programs are the same, suggesting 
thai the total amount of computation donc by all processes for solving the problem is 
about the same for both the programs. Consequently one has to look for some 
_ explanation other than increased computation for the 25% decrease in absolute 
execution time at this parallelism. The following are two possible explanations for 


Thus the number of nodes eencrated is a measure of the total — 
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Figure 2: Speedup versus parallelism for TSP? - 


this behavior: [Virst, in TsP1 (with its master-slave structure), when the master is busy 
deciding on which node to expand next, the slaves idle (after creating a look-ahead 
node for the next iteration) wasting processor time. Because of the cgalitarian process 
structure of Tsp2, processes here do not idle waiting for more work. Secondly, in TSP1 
(with its synchronous program control), some time is wasted because of a lack of 
absolute work balance between the slave processes. On the other hand, in Tsp2, with 
its asynchronous control, absolute work balance between processes is not critical, and 
therefore no time is lost because of imbalance. 


Figure 3 factors out the effects of cluster Icvel contention on speedup. Cluster 
level contention here refers to contention for resources that are replicated for each 
cluster in the run-time system: these resources include the Kmaps, the Map busses, 
and the Object Manager. Cluster level contention is distinguished from system level 
contention, that is, contention for system-unique resources of the run-time 
environment and of the application program itsclf. Lxamples of system-unique 


. resources are the intercluster bus and user locks. Speedup for this figure is computed 


using solution times with the same number of processes in each cluster to ensure that 
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Figure 3: Speedup (normalized for cluster level contention) 


versus parallelism for TsP2 


cluster level contention is nearly the same for all data points on a given curve. The 
curves hug the lincar speedup line closely except near a parallelism of 16; this 
suggests that below a paraliclism of 12 there is hardly any system level contention and 
most of the loss in execution time seen in the previous speedup figure is attnbutable 
to cluster Jevel contention. Near a parallelism of 16, system level contention affects 
performance adverscly to a more significant extent. 


5. Work and Heuristic Granularity 

Work of a program, here. refers to the total amount of basic computation done by 
all the processes of the program to solve the given problem cooperatively; work 
excludes processor idling, resource contention, and overhead costs, such as the costs 
of locking and communication. Though these excluded costs are expected to change 


with the implementation strategy that is used to map a general algorithm into a 


parallel program, work itself, as defined above, is usually assumed to remain 
constant. In practice, however, the mapping decisions can affect the work of a 


program. Harlicr experiments by other researchers [S] showed that work can change 
with the degree of parallelism of an execution instance of a program and with other 
progvam attributes, such as control synchronism. ‘The two implementations of the 
ILMSK aigorithm illustrate this phenomenon further. 
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Figure 4: Work variation in TSP programs 


As observed before, the primary reason for the better specdup performance of TsP2 
is directly attributable to its work remaining constant with parallelism (compare 
Figures | and 2). For the LMSK algorithm, the total number of nodes in the finai 
state-space tree is an indicator of work donc by the programs. [T‘igure 4 shows how 
the work for the two programs varics with parallelism. lor 1sp1, work increases with 
parallelism, while for 1Sp2 it remains nearly constant. 


Where does this difference in work between the two programs arise from? The 
amount of work done by a heuristic search algorithm, such as the LMSK algorithm, is 
determined by the effectiveness of the heuristics in bounding the search. This 
difference in work can be explained by the difference in their heuristic granularities. 
For a heuristic search program, the average amount of computation by al! its, 
processes between points of heuristic application is the heuristic granularity of that 
program [5S]. For the 1sP programs, one kind of heuristic granularity can be 
associated with the node-sclection heuristic, and another with the edge-selection 
heuristic. Figure 5 depicts how the two kinds of heuristic granularitics of the TSP 
programs vary with parallelism. Since the major amount of computation in an_ 
iteration gocs into setting up new nodes, granularity here is expressed in terms of the 
amount of computation needed to set up one new node. The heuristic granularity 
corresponding to the nodc-selection heuristic and that corresponding to the edge- 
selection heuristic will be examined scparatcly below. 


Peg, 
A 


@—@ Nodc heuristic granularity, TSP 1 
Oo — © Idec heuristic granularity, TSP 1 
*—--—x Both heuristic granularitics, TSP 


Granularity (in node set-ups) 
~ — me 
Ss Na 


nN A A & 


a ee ee Se ee eee Se 


2 14 16 


Parallelism 


Figure 5: Granularity variation in TSP programs 


First, let us consider the node-selection heuristic. Both programs apply this 
heuristic once every iteration. In TsP1, the number of nodes set up by the master | 
process in one iteration equals the degree of parallelism and hence the heuristic 
granularity cquals the degree of parallelism. The heuristic granularity then increases , 
linearly with parallclism. In tsp2, however, for each iteration two nodes are set up by 
each process, irrespective of parallelism. Therefore heuristic granularity for this 
program remains constant at two nodes. 
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The edge-selection heuristic is applied once for cach edge sclected in both 
programs. In ‘rsp, for a parallelism of N, during cach iteration log, N edges are 
selected by applying the heuristic once for cach edge selected and N nodes are set up. 
The heuristic granularity corresponding to this heuristic, then, is N/ log, N when the 
parallelism is N. In vsp2, irrespective of paraliclism, each process applies the 
edgc-selection heuristic once to select an edge and sets up two nodes during cach 
iteration. ‘The heuristic granularity for this program thus remains constant at two 
with changing parallelism. 


A comparison of Figures 4 and 5 will reveal the close parallel between the 
variation in the heuristic granularitics of the two programs and the work done by 
them. The heuristics in these programs serve to limit the extent of the scarch necded 
to solve the problem. An increase in heuristic granularity decreases the effectiveness 
of the heuristics in limiting the search and consequently Icads to an increase in the 
total amount of computation needed to solve the problem. ‘The relation between 
work of a heuristic search program and its heuristic granularity (or alternatively the 
frequency of application of heuristics) should not come as a surprisc. Ilowever, the 
possibility of heuristic granularity changing with the degree of parallelism is peculiar 
to parallel heuristic programs. 


6. Conclusion 

In general when a programmer scts out to implement a parallel program based on 
some algorithm, he can design many parallel programs with different attributes. 
These attributes will depend on the design decisions he makes when mapping the 
algorithm to a program and, in particular, on the parallelization strategy that he 
adopts. As this work demonstrates, some attributes of the resulting program will 
influence its execution time and speedup characteristics to a significant extent. A 
programmer has to analyze the effects of not only the obvious program attributes 
such as the control synchronism and the pattern of resource usage, but also the more 
subtle ones such as the granularity of program events and the amount of computation 
the program needs to solve the problem at different degrees of parallelisms. 
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Abstract -- This paper describes DOT, a model of an 
architecture implementation specifically designed for 
direct and maximally parallel execution of FFP (formal 
functional programming) language programs. The model 
is represented using tasks and abstract data types in C 
and is running on UNIX® 


DOT is a refinement and extension of the architec- 
ture proposed by Mago [1,2]. User programs consisting 
of FFP language symbols are placed in a linear array of 
cells (the leaves of a binary tree of processors), and seg- 
ments of this array that contain innermost FFP applica- 
tions execute system programs in order to perform the 
required reductions. The system programs are written 
in LPL, a low-level concurrent programming language 
used to implement FFP primitives on the architecture. 


INTRODUCTION 


This paper deals with a language-driven architec- 
ture. Over the last decade there has been increased 
interest in this area, and a number of machines 
oriented toward efficient support for high level 
languages have been designed. This approach has been 
used for support of sequential languages such as Basic 
[3], Lisp [4], and Pascal [5]. 


With the increasing potential of VLSI implementa- 
tion technologies, highly parallel architectures that hold 
great promise for increased performance have become 
feasible. Although a number of parallel computer archi- 
tectures have been proposed, few of these have been 
directly associated in the above sense with a general 
purpose programming language. This is due to the low- 
level sequential transformation of states and reliance on 
global memory embodied in most languages, which 
make it difficult to express a direct mapping between a 
language and its execution on a parallel architecture. 
Notable exceptions to this are data flow languages [6], 
and functional languages [7]. 


Data flow languages arose from the search for a 
general model of parallel computation, and it is there- 
fore not surprising that a variety of parallel architec- 
tures for their support have been proposed and some 
implemented [8]. Functional languages, on the other 
hand, arose from the search for an algebra of programs, 
as described by Backus [9,10]. In this case, sequential 
transformation of states and global memory were ban- 
ished because of their unsatisfactory properties with 
respect to program semantics, and as a serendipitous 
result, functional languages are promising candidates 
for parallel support. 


Among possible architectures for this purpose are 
those suggested by Keller et al. [11], Treleaven [12], and 
Mago [1,2]. This paper is based on the work of Mago, 
which differs from other proposals in its use of fine 
grain parallelism, This approach removes assumptions 
of global memory and overall processor state from the 
language support level as well, and completely realizes 
the parallelism allowed by functional programs. 


This work was supported by the National Science Foundation under 
grant MCS80-04206, and by a grant from the Harris Corporation. 
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We now present a programming system composed 
of three logical levels. As shown in Figure 1, the top 
(user) level is that of FFP languages, and the middle 
(system support) level is that of LPL, the concurrent 
programming language used to define and implement 
arbitrary FFP operators on the architecture. DOT is 
both a design and an implementation model of the 
desired parallel architecture, and is the lowest level. 


Figure 1 -- Three System Levels 


FFP -- User Level 


LPL -- 


Operator Support 


LPL & FFP Support 


FFP LANGUAGES 


FFP languages have been formally defined by 
Backus [13]. Informally, an FFP language program is a 
linear sequence of symbols, of which four types of sym- 
bol are specially distinguished for the purpose of provid- 
ing structure: opening and closing application-forming 
symbols for applications, and similarly balanced list- 
forming symbols. An application is composed of an 
operator and exactly one operand. Both operator and 
operand may be lists and may contain further (ie., 
nested) applications. A non-trivial FFP program is an 
application, and execution proceeds by successively 
reducing innermost applications according to the 
semantics of their respective operators until there are 
no further applications. The ultimate result is a con- 
stant (ie., non-reducible) expression. This is called 
reduction style execution, since the program source is 
rewritten in a succession of semantically equivalent 
forms until the final result is achieved. 


FFP reductions are completely local in nature and 
are tightly encapsulated with respect to the rest of the 
program. This fact allows immediate, completely paral- 
lel and non-interfering execution of all innermost appli- 
cations (hereafter referred to as reducible applications, 
or FAs), and it is this property of FFP languages that 
makes them so attractive for multiprocessor support. 


Figure 2 shows an FFP program which calculates 
the inner product of two vectors. The application sym- 
bol in our representation is a parenthesis "({", and the 
list-forming symbol is an angle bracket "<". Within DOT,. 
all program symbols have an associated FFP text nest- 
ing level, which removes the need for the balancing sym- 
bols ")" and ">". Examples of this representation are 


found in Figure 4 (at end). 


FIGURE e: Inner Product of <123>with<456> 


The original FFP program is: 
(+(<a*>(7r<<123><456>>))) 

7 (matrix transpose) is innermost 
(+(<at*><<14><25><36>>)) 

<a * > (apply-to-all multiply) is innermost 
(+<(*<14>)(*<25>)(*<36>)>) 
three multiplications are innermost 
(+<41018>) 

+ (n-ary add) is innermost 


32 
--__which is the answer 


Mapping FFP onto a Linear Array of Cells 

The advent of VLSI has encouraged the design of 
content addressable storage and other types of "intelli- 
gent" memory [14], and it is only natural to attempt to 


envision an intelligent memory in which a program: 
might be loaded and then executed in place. The locality 


properties of FFP languages indicate they would be good 


candidates for such treatment, and this would avoid the: 
CPU bottleneck associated with von Neumann proces-. 


sors. The usual idea of memory is a linear address 
Space, so as a first step we can imagine placing the sym- 
bols of an FFP program into a linear array of cells 
(hereafter called Icells), each of which comprises pro- 
cessing power as well as memory. 


Reduction of an FFP RA within a group of contigu- 
ous lcells will produce a new FFP program segment in 
place of the original RA. Each symbol of this new seg- 
ment is a function of the original contents of the Icells 
comprising the RA, and the semantics of the operator of 
the reduction. We therefore need a way of specifying the 
actions to be taken within the Icells of an RA that will 
result ‘in the correct (operator-defined) transformation, 
i.e., we need an lcell programming language, or LPL. 
Once an RA has been detected, the lIcells of its contain- 
ing segment will each execute the LPL program 
appropriate to the operator of the reduction. 


The particular FFP primitives chosen and their LPL 
implementations will be the concern of a system 


manager, possibly in communication with knowledge- 


able users. The compiled LPL code modules, which may 
be thought of as system software support routines, are 
held in a library available to the JO subsystem, and are 
broadcast to the Icells when they are required. 


A Tree-Structured Architecture 


The existence of such FFP operators as apply-to-all 
(shown in Figure 2) implies global communication within: 


an RA, and the topology chosen to support this commun- 


ication is that of a binary tree with dynamically 
reconfigurable routing. Non-leaf nodes of this tree are 
hereafter called fcells, and the leaves will be the Icells, 
as already discussed. 


Figure 3 -- DOT MODEL STRUCTURE 
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Figure 3 shows the logical structure of DOT. In 
addition to the usual parent-child links associated with 
binary tree structures, we make use of lateral connec- 
tions at the lIcell level in order to facilitate shifting the 
FFP text about to accommodate its expansion and con- 
traction. While these connections are not strictly neces- 
sary, they simplify the model and seem feasible within a 
hardware implementation. 


LPL -- LCELL PROGRAMMING LANGUAGE 


The aim of an LPL program is to define an FFP 
primitive. This is done by specifying appropriate actions 
for each lIcell of an application. LPL is therefore 
designed to manipulate local Icell registers, and possi- 
bly invoke simple global operations (e.g., message pass- 
ing) with which LPL statements in other lIcells of the 
same RA may interact. In practice, various groups of 
leells within the RA are given the same instructions 
{e.g., all elements of a sequence), so an LPL program 
consists of code segments -- one for each such group. 


The most interesting aspects of LPL include the 
message interactions between the lIcells of an RA 
(send/receive/endfilter), and the way in which LPL con- 
texts may spawn copies of themselves (fork) in order to 
create additional FFP text symbols within the lIcell 
array. With these capabilities, LPL programs can imple- 
ment powerful FFP operators such as matrix transpose, 
and parallelism within the tree structure can be used 
very effectively. For example, sorting is @(n), and max is 
O(log n). 


LPL Environment 


There are no local {stack-based) variables in LPL. 
Instead, a fixed number of LPL environment variables 
within local Icell registers may be referred to. Many 
environment variables are set up by DOT before LPL 
statements are allowed to execute. Among these are 
the local FFP symbol (called symbol), its nesting level 
within the FFP program (called aln for absolute level: 
number), its nesting level within the RA (called rin for 
relative level number -- i.e., relative to the aln of the 
application symbol) and the "directory," which is used 
to specify the location of the FFP symbol within the RA. 


The directory is composed of two parts: the first 
part is a symbol_index, specifying the position of symbol 
within the RA; the second is a 4-tuple that encodes the 
symbol location based on up to four levels of hierarchi- 
cal nesting within the RA. Figure 6 shows the directory 


during execution of an example FFP program. In addi- 
tion to its use by LPL statements, the directory 4-tuple 
is also used by DOT to choose which segment of an LPL 
program should be executed within an individual Icell. 
This will be explained in conjunction with the LPL desti- 
nation statement. 


When the lIcelis in an RA have all completed execu- 
tion of their LPL program segments, the reduction is 
"stepped forward" to its result. This is done by DOT with 
the aid of the environment variables nsymbol_cnt, 
nsymbol, and naln, The "n" prefix stands for "next," 
and these variables are set up in each lcell of an RA by 
the LPL program. If nsymbol_cnt is zero when the RA is 
stepped forward, the containing Icell becomes empty 
{i.e., there is no FFP symbol in the Icell). Otherwise 
nsymbol is moved to symbol, and naln is moved to aln.. 
Thus, the LPL programmer is primarily concerned with 
creating code which (for each lcell of the RA) will load 
nsymbol and naln with the symbol and aln values which. 


should next appear within the lIcells of the RA in order to 
implement the required reduction. 


Having described the objective and environment of 
LPL programs, we now briefly consider their structure 
and the more interesting statements. 


LPL Statements 


program/endprogram. A program Statement is the 
first statement of an LPL program. Its form is. 
program x where "x" is the (numeric) identifier of the 
FFP operator whose operation this LPL program is 
intended to implement. The assembler creates a library 
object file for subsequent use whose name is based on 
this identifier. The end of an LPL program is signalled 


with an endprogram statement. 


destination/endsegment. The same sequence of 


LPL statements is not executed in each lIcell of an RA. 
Instead, an LPL program defines code segments and 
specifies their respective lcell destinations through the 
use of the destination statement. The first segment of 
an LPL program whose destination matches an lcell’s 
directory 4-tuple is the segment that the lcell will exe- 


cute, and all following segments are ignored. The form. 


of the destination statement is destination di d2 d3 d4 
where "di" through "d4" are either an integer, or an 
integer followed by "*". A match, as referred to above, 
occurs if each of the Icell 4-tuple directory entries is 
either equal-to (no "*" used) or equal-to-or-greater-than 
("*" used) the the respective destination value. The end 
of a program segment is ee with an endsegment 
statement. 


fork. This statement is the means by which addi- 
tional lcelis are allocated to hold expanding FFP text. 


The name "fork" is given this statement because each 
lcell may be thought of as a single process that executes. 


a sequential LPL program segment. A fork spawns copies 
of its program segment and-its execution context to 
create new processes in the requested number of con- 
secutive lcells. Execution continues after allocation and 
loading of these Icells by DOT (during storage manage- 
ment). The form of the statement is fork forksize where 


"forksize" is the (non negative) number of lcells desired. 


The fork_id environment variable is set by DOT during 
support for this operation. The "parent" of the fork 
operation is always given fork_id = 1, while the children 
are given fork_id = 2 through forksize in left-to-right 
ordering. This fact can be used in subsequent LPL state- 
ments to condition execution. Copying or moving 
groups of FFP symbols into new locations within the FFP 
text is done using fork followed by send/receive. 


send/receive/endfilter. These statements are the: 


means by which global communication within an RA is: 


carried out. They are supported by DOT processes within 
_both lcells and tcells. Messages are sent and received 
during globally sequenced activities called message 
‘waves, and all the Icells of an RA have the option of par- 


_ticipating in any of them. A limited amount of process-. 


ing can take place within the tcells during transmission 
of a message wave, and appropriate instructions to the 


teells concerning processing requirements are sent up. 


by the lIcells to introduce each new message wave. (This 
information is supplied in the send statement.) The LPL 


messages within a message wave travel from the lcells 


up through the tree structure above them until they 
reach the lowest common ancestor of all lcells within 
the RA {called the top of area or toa). At the top of 
area, all messages that have come this far (messages 
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may be combined, or passed selectively on the way up) 
turn around and. are broadcast to all Icells in the RA. 
Those lIcells doing either a send or a receive for that 
particular message wave then "see" all returning mes- 
sages on that wave. Send and receive statements havea 
filter portion that describes the actions to be taken for 
each incoming message, and a local DOT message pro- 
cess invokes this filter for each arrival after first moving 
the message into a reception area within the lIcell. The 
difference between send and receive is that a send 
sends a message (then filters incoming messages, 
including its own), while a receive merely filters incom- 
ing messages. Their forms are as follows: | 


scnd mvave order combine-op key! key2 arg_cut i. 


filter-statements | | 
endfilter 


receive mwvave 


filter-statements 
endfilter 


The mwave argument is the index of the message 
wave desired. The order argument indicates the order in 
which two messages of differing key values should be 
sent up from a tcell where they meet. (Keyl is given. 
precedence over key2.) When two messages arriving ata 
tcell both have the same key values, this indicates that. 
the respective messages should be combined according 
to the combine-op argument. "Arg-cnt" is the number 
of additional message arguments (in addition to the key 
values) which are to be sent. The additional arguments 
are taken from lcell registers with names m_arg] ... 
m_arg5, When messages are combined arithmetically, it 
is the m_argi values which are actually combined. 
r_keyl, r_key2, r_argi ... r_argd are the lIcell registers 
into which a message is placed by. DOT prior to execut- 
ing a filter. 


LPL Program Example 

This section presents the LPL program for n-ary 
add which supports the FFP program whose. execution 
trace is given later. 


progran 004 /* FFP N-ARY ADD OPERATOR */ - 


destination 0000 ide The application syrbol 


keep 
receive 1 /* receives the result. 
mpv r_argi nsyrbol 
endfil ter 
endsegrent 
destination 1* 000 /* Operator, seq syrbols 
endsegrent /* go away. | 


destination 0* 0* O* 0* /* Nurbers to be added 
mov syrbol marg! /* send tharselves 
send 1 ++ #0 #0 1 /* using addition 
endfil ter 
endsegrent 


MOTO OG OTUs ae 
DOT -- SUPPORT FOR LPL and FFP 


We have described the functions and controlled 
environment made available to LPL by DOT. It is now 
necessary to examine how these are provided while also 
supporting the innermost reduction semantics of FFP 
languages. 


A DOT machine cycle starts with looking at the Icell 
array to see what is in it. In the course of this operation, 
RAs are discovered, and the machine is parfifioned so as 
to correctly allocate communication channels and tcell 
processing power to the discovered RAs. The first time a 
particular RA is encountered (RAs may exist over a 
period of many machine cycles), DOT processes within 
the tcells and lIcells build the Icell environment direc- 
tory, and the LPL program is loaded using the IO subsys- 
tem. After these operations, all of which take place in 
the partitioning phase of the machine cycle, the LPL 
programs in RAs are started (or restarted). This hap- 
pens seperately within each RA, so RAs that do not 
require a new directory and LPL code can be restarted 
earlier. 


At this point, the notion of a single machine is 
misleading; each RA has its own dedicated hardware and 
is completely independent of the others. Nevertheless, 
after the RAs are started (or restarted), the machine 
may be thought of as being in an execution phase. The 
LPL programs run, with the aid of DOT-provided ser- 
vices, until they become blocked or are preempted by 
DOT for the purpose of storage management. 


The sforage management phase includes stepping 
forward any RAs that have finished, determining the new 
storage reguirements of the FFP pregrams within the 
leell array (due to LPL fork statements that have been 
executed), and shifting LPL program segments and their 
contexts within the lIcell array to make room for newly 
required symbols. The shifting process is performed 
using the lateral Icell connections (shown in Figure 3) 
and may result in the overflow of contexts into the 
overfiow subsystem if enough Icells are not available, 
reentry of previously overflowed Icell contexts back into 
the physical lcell array if there is room, or entry of new 
FFP programs if there is room after previous overflow 


has been taken care of. The prescription for exactly how. 


the leell contents are to be shifted about is called the 
specification for storage management, 


The basic machine cycle is thus partitioning, execu- : 


tion, and storage management. These phases will now 
be described in more detail. — | 


Partitioning Phase 

Partitioning creates active areas, each of which is a 
binary tree dynamically imbedded within the overall 
tree-structured architecture. Each active area is com- 
posed of the dedicated communication channels, and 
the Icell and tcell hardware required for supporting 
computation in an individual RA. Partitioning begins in 
the lIcells and is continued in the tcells. Kach tcell 
receives (from its childen) and sends (to its parent) a 
code containing the information necessary for an inifial. 
partitioning of the tcells. 


The initial partitioning allocates and connects dedi- 
cated area communication channels (called area chan- 


nels) and dedicated tcell processing power {called area. 


nodes) to each underlying group of Icells that may con- 
tain a different RA. While the area channel connections 
are changed with each partitioning, the information 
required for the initial partitioning travels upwards on 
cell manager channels whose connections never change. 


The initial partitioning is terminated within the [0 


subsystem, which may be thought of as the parent of 
the root of the tree. (In addition to its 10-related activi- 


low the hardware tree structure. 
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ties, the IO subsystem offloads special termination pro- 
cessing from the tree root.) RAs are finally located and 
their corresponding active areas created with the aid of 
concurrent downsweeps within each of the candidate 
areas created by the initial partitioning. Figure 4 (at 
end) shows the area channels and nodes for a parti- 
tioned DOT processor. 


Area Nodes. As suggested by Figure 4, a tcell need 
only provide processing power for one active area. This 
is because even though area channels for more than one 
active area may pass through a given tcell, it is always 
possible to directly route through the tcell (via what 
may be considered a straight wire connection) all but 
one set of area channels, which, if it exists, is composed 
of the area channels leading to two children and possi- 
bly a parent, all in the same area. During partitioning, 
such a set of area channels is connected to a tcell area 
mode whose primary purpose is to support all subse- 
quent area-related processing within the tcell. 


This support begins with completion of the parti- 
tioning phase (i.e., discovery of active status, discovery 
of the FFP operator if the area is active, pruning of 
channels that are for one reason or another not 
required for area processing, creation of the foa where 
messages will turn around, and directory creation). Par- 
titioning is then followed in an active area by support 
for the LPL send and receive operations. This is then fol- 
lowed by correctly shutting down operation prior to the 
storage management phase of the machine cycle. This 
shutdown must disconnect area channels (which were 
created during partitioning), but only after stopping 
messages in such a way as to guarantee that all Icells in 
the area will have seen exactly the same messages dur~ 
ing the execution phase. This must be done in order to 
guarantee a consistent restart following storage 
management and re-partitioning. 


Directory Creation. Following the initial partitioning 
upsweep, each toa returns to its descendent Icells 
notification of their active status and the LPL program 
to be used if one is necessary. Given this information, an 
Icell will decide to create a directory if it is contained in 
a new RA. This requires an upsweep and a downsweep 
within the area channels, and the result is to load 
symbol_indez, and the 4-tuple directory, di.., d4, with 
the correct values. If the RA is not new, the old direc- 
tory is still valid and execution may begin immediately 
without this step. 


Loading LPL Programs. The LPL programs are 
delivered from the IO subsystem on io channels that fol- 
Within a tcell, each 
parent io channel splits into two child io channels and 
data movement is as follows: input to the lIcell array 
comes from "above" and is broadcast to all lcells by suc- 
cessively splitting data so what comes in from a parent 
input channel is sent down both child input channels; 
output comes from "below", and is sequenced by han- 
dling the child output channels in cyclic left-to-right 
order. There are'two very simple processes in the tcell 
that perform these functions. At present, the input 
channels are used to deliver LPL programs from the 
library, and output channels are used to return execu- 
tion results and trace information to the outside world. 


Execution Phase 

The Icell LPL interpreter is a process that receives 
starting addresses from a queue. It begins execution at 
the requested address, performs local data movement: 


and manipulation as indicated by the loaded LPL object 
code, and continues until encountering one of the fol- 
lowing DOT service requests which require special han- 
dling: send, receive, endfilter, fork, and endsegment. 


These special services are initiated by setting up an 
LPL context area associated with the particular service 
required. These areas are checked by the DOT 
processes whose job it is to provide the services. Having. 
set up the service request area, the interpreter then 
cycles back for another start address. The reason for 
this approach will be seen in the following discussion of 
message support. 


Leell Message Support. Whenever a message arrives 
which should be filtered {due to a send or receive for 
the present message wave), the DOT Icell message input 
process first puts the newly received message into a 
receive area {accessible to filter statements using 
named variables such as r_arg1), and then uses informa- 
tion deposited earlier (by the interpreter) in the LPL 
program context to insert the beginning address of the 
message filter statements into the interpreter start 
address queue. The interpreter executes the message 
filter for the message instance, and then encounters the 
endfilter statement, which then halts the interpreter as 
described above. This is done for each message that 
arrives on the present message wave. When the wave has 
completed, the Icell message input process places the 
continue address (i.e., the address of the first statement 
following the endfilter statement) into the interpreter 
start address queue, and LPL execution then continues. 


Message waves are sequenced activities whose com-: 


pletion requires agreement among all of the lcells of an 


RA. The basis of this agreement is an end-of-wave mes-. 
sage or eow that is sent for each message wave by all 


Icells of an RA, merged into a single message by the 
time it reaches the toa, and then returned to the lcells 
in the RA. Lcells keep a counter which contains the 
present message wave number. 


Whenever the message wave counter is incre- 
mented in an executing Icell {due to receiving eow), the 
LPL program context is checked for a send or receive 
request for the new wave number. If there is such a 
request, an eow is sent (after message transmission if 
the request was send). If the request is for’ a send or 
receive on a higher numbered wave, eow is also sent. If, 


however, the last message request is for a lower num-: 


bered message wave, a fork has been executed. In this 
case, the eow is not sent. Instead, we wait for storage 
management to complete the fork operation. The result 


is that the new message wave cannot pass through the. 


toa until after storage management (and completion of 
the fork operation). 


Following storage management, everything is res- 
tarted correctly so a message wave interrupted by 
storage management may complete, and the next one 
begin (all transparent to the LPL program). This allows 
implicit synchronization of a fork operation with a 
corresponding send designed to copy information. 


Fork Support. A fork statement halts execution 
within the requesting Icell until the operation can com- 
plete during storage management (when LPL program 
contexts are shifted in the lcell array). Execution then 


resumes in the child lcells as well as the parent. LPL 


program contexts begin each execution phase acting as 


if they had requested a forksize of 1. The fork state- 


ment merely modifies the lcell variable (not directly 
available by name to an LPL program) in which this 


198 


value is stored so that multiple copies of an LPL pro-- 
gram context are shifted rather than just one. during 
the next storage management. 


Storage Management Phase 

This phase is necessary to accommodate atonth 
and compaction of the FFP text while retaining the 
necessary ordering of FFP symbols. It is unfortunate 
that execution of LPL programs should in general 
require interruption in order to implement this phase 
of the machine cycle. One alternative is to let: all LPL 
programs complete (or become blocked as in a fork 
operation) before storage management is performed, 
but this could put RAs with quickly executing LPL 
programs at a disadvantage, and would likely result in 
inferior utilization of the available processing Owes ina 
large machine. 


Attempts have been made to do storage manage- 
ment in locally restricted segments of the Icell array (as. 
computation proceeds elsewhere) by Tolle [15], but the 
complexity of the overall solution is considerable, and 
the resulting performance is not always ses to the 
preemptive approach that we use. 


In the present design, lIcells send permission to 
start storage management upwards on the cell manager 
channels to the IO subsystem. Lcells which are not 
active do this following partitioning. Active lcells wait for 
the LPL program to complete, or fork before sending. 
permission. These sm_granf messages are merged on 
their way up the tree, and upon reaching the JO subsys- 
tem, result in generation of a stop message which then 
travels down the tree and shuts down message activity. 
This approach is attractive, since it places control of the 
processing cycle explicitly within LPL, and allows a sys- 
tem manager to tailor FFP operators for large operands 
if this is desired. Another possibility would be to allow 
the 10 subsystem to use heuristics based on lcell con- 
tents (discovered during partitioning) to determine an 
appropriate cycle time. 

The Specification for Storage Management. Once 
the LPL programs are shut down, a specification for 
storage management must be computed. This is done by 
sending and merging forksize information up the cell 
manager channels until it arrives at the IO subsystem, 
where, as in partitioning, the upsweep is terminated. 
There, a specification for storage management is com- 
puted, and sent back down the tree in such a way as to. 
distribute the necessary information to each lIcell: A 
variation on the scheme suggested by Mago [1] is used, 
so that total compaction will be performed only when 
necessary. 


Overfiow and Program Entry. The virtual memory: 
concept used in the model is based on the work of Sid- 
dall [16] and Frank [17], who have examined various 
ways to accommodate overflow from the lIcell array. The 
approach used is to allow storage management into and 
out of the left lcell tree boundary. To the left of this 
boundary is a deque structure (interfaced with a file 
system), that receives from its right any lcell contents 
that overflow from the tree, and from its left new pro- 
grams for execution in the tree. The state of the 
overflow and program entry subsystem (e.g. if there is 
presently overfiow in virtual memory, if so how much, 
how large the next FFP program to be entered is, etc.) 
is used by the IO subsystem in its determination of the 
actual storage management specification to be used. 


REPRESENTING DOT 


In order to specify a detailed implementation for 
the architecture, we required a powerful representation 
language capable of expressing the parallel activities of 
the envisioned architecture with precision. We have 
access to a UNIX®system, so we decided to use the C 
language augmented with abstract data types (classes 
and. tasks [18]) in order to produce a comprehensive 
and executable model. 


The DOT Icell and tcell classes are shown in Figure 5 
{at end). The complete DOT model representation 
includes 25 classes. When the model is executed, a tree 
height parameter is given, and the required number of 
these. classes are instantiated and connected to forma 
machine of the desired size. Operation then begins with 
partitioning of the (empty) machine. DOT models for 
machines containing hundreds of Icells may be created 
and observed within computationally reasonable time 
periods. 


To illustrate operation of the model, Figure 6 (at 
end) contains trace output from the example FFP pro- 
gram: 

(+ ( < apply-to-all *> << 13><24>>)) 

Column headings are for the user program id, lcell 
symbol, Icell state (O=ground, 1=executing, 2@=com- 
pleted), fork_id, aln, rln, symbol_index, and the direc- 
tory 4-tuple. Columns to the right of the arrow are used 
to indicate the result of stepping a completed reduction 
forward. Empty Icells do not appear, and cells with 
symbols in them are listed in left-to-right order. Output 
is at the end of each cycle. 


CONCLUSIONS 


This paper has described a complete and opera- 
tional model of a multiprocessor architecture designed 
for maximally parallel execution of FFP programs. The 
word "complete" is relative to (among other things) the 
level of detail chosen, and we have deliberately stopped 
before the hardware level. Nevertheless, all necessary 
algorithms (sequential and concurrent) have been con- 
cretely represented within DOT, as have all external 
interfaces, and many other design details. LPL has been 
defined, and an assembler written. The resulting model 
is as close to a true machine (at levels above the 
hardware) as is possible. Major improvements (with 
respect to both simplicity and performance) on the ear- 
lier design by Mago have been realized. These further 
increase the desirability of the architecture. 


Representing the model in a concurrent program- 


ming language has allowed testing and verification of 


DOT. The importance of verification is obvious when the 
model is to be imbedded in hardware. Other benefits of 
a complete software model include the possibility of 
performance simulation, and estimation of hardware 
complexity. 

An analytic model of program execution time on 
such. .a tree machine has been proposed by Mago et al. 
[19]. Using this as a guide, a similar model based on the 
DOT model has been developed. This work, and other 
details not included here will be found in a forthcoming 
dissertation [20]. 


‘If performance simulation supports the belief that 
an architecture based on the model we-have described 
is a good idea, the next hurdle is mapping the DOT 
design into hardware. DOT has been created with VLSI 
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technology in mind. Binary tree structures map well 
onto VLSI, the processing cells are self-timed, and active 
tasks in a model instance should correspond to 
hardware areas specifically designed for their support. 
The lcell interpreter is probably the most traditional 
and straightforward of these, while the tcell JO relay 
mechanisms and downward message handler will likely 
be trivial due to their simplicity. All of the data and. 
message movement in the machine has been designed 
with a bit serial pipelined approach in mind. 
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Figure 5a -- The DOT TCELL Class 


Figure 5b -- The DOT LCELL Class 
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Figure 6 -- TRACE OUTPUT 


NDX-DIR --> NSYM NALN 
000 0000 
000 0000 
000 0000. 
001 1000 
002 1100 
003 1200. 
004 2000 
005 2100 
006 2110. 
007 2120 
008 2200 
009 2210 
010 2220 
NOX-DIR --> NSYM NALN 
000 0000 
000 0000 
000 0000 
001 1000 ( 
002 1100 
003 1200 
004 2000 
005 2100 
006 2110 
007 2120 
008 2200 ( 
008 2200 
008 2200 
009 2210 #002 004 
010 2220 #004 004 
NDX-DIR - -> NSYM NALN 
000 0000 
000 0000 
000 0000 
000 0000 
001 1000 
002 2000 
003 2100 
004 2200 
000 0000 
001 1000 
002 2000 
003 2100 
004 2200 
NOX-DIR --> NSYM NALN 
000 0000 «= #011.—«000 
001 1000 
002 2000 
003 2160 
004 2200 
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HAE End of Cycle 1 #44444 
This application not innermost 

4 is the op-code for n-ary add 
This application is innermost 
so state = executing 


8 is the op_code for apply-to-al] 


12 is the op_code for multiply 


This symbol forks and receives 
a copy of the operator (mult) 
as required by apply-to-al] 
HAHAH? End of Cycle 2 #44H+H4# 


The reduction has completed 
and so is stepped forward. 


The result is a sequence of 
parallel multiplications. 


Note that the fork_id tells 
how "<" should step forward 
for these three symbols. 


BAH End of Cycle 3 #44HHAHE 
Figure 4 gave a partitioning 
for exactly this FFP text, with 
its two parallel RAs and two 
supporting active areas. 

Both multiplications complete 
in one cycle, and are stepped 
forward. 


HAH End of Cycle 4 #44HHHE 
Add is now innermost, and it 
also completes in one cycle. 
11 is the answer. 


THE TREE MACHINE 
AN EVALUATION OF STRATEGIES FOR REDUCING PROGRAM LOADING TIME 
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Abstract -- The Caltech Tree Machine has an ensemble 
architecture with processors interconnected as a binary 
tree. The program code has to be loaded to the Tree 
Machine through the. root processor. By exploiting the 
regularity of the user defined tree structure, the down- 
loading time can be reduced from O(N) to O{log,N), where 
N is the total number of nodes in the tree. 


This paper presents three algorithms for loading the node 
types of the tree. All algorithms have a best case loading 
time of O{log,N). Two of them, which take a binary tree 
descriptions as input, have a worst case performance of 
OW/N/f) and O(N!/!°82!) for regular trees of fanout f. The 
third algorithm has a worst case performance of O(log.N). 


Introduction 


The Caltech Tree Machine has an ensemble architecture, 
[1],[2] with processors interconnected as a binary tree. 
Bach node is a complete, although small, von Neumann 
machine that can execute its own program. Synchroniza- 
tion is made by message passing between adjacent nodes. 
Hence, it is significantly different from other tree 
machine projects such as [3], [4]. The Tree Machine is an 
attached machine. The host compiles and loads a pro- 
gram into the Tree Machine, loads data into it and in- 
teracts with it during execution. The root is the only 
point of communication with the outside world [5]. 


To program the tree machine, a user has to be aware of 
the tree structure. It is necessary to devise a tree data 
structure and algorithm for the problem to be solved. The 
tree may have arbitrary fanout and arbitrary size. Map- 
ping onto the binary tree configuration of the Tree 
Machine is performed by the software system. A fanout 
greater than two is achieved by introducing so called pad- 
ding nodes as descendants to a Tree Machine node until 
the specified fanout is realized by the descendants of the 
last level of padding nodes. Padding nodes are inserted 
in the left and right subtrees in alternating order, at 
every level, to minimize the unbalancing. Padding nodes 
as well as their code is generated by the software system. 


In many tree machine algorithms devised so far [1],/6], 
the user defined logical tree is often designed to have 
only two or three node types, one for the root, one for the 
leaf processors, and one for the intermediate processors 
of the tree. Only different pieces of code, i.e., one copy of 
the code for each node type need to be supplied to the 
root. Replication of program code is easily accomplished 
by having the processor copy the code it receives, if it is 
of its type, and always pass it on to its two descendants. 


Assigning node types in the Tree Machine 


All considered algorithms load types into tree nodes by 
recursively and concurrently expanding a description re- 
ceived by the root. A naive approach is to simply load the 
types of all nodes in a left to right, root to leaf order. The 
length of the input of node types is proportional to the 
size of the tree. This fully explicit description rapidly be- 
comes too costly. 


The size of the tree descriptions we propose is at best 
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proportional to the number of different node types, which 
in many cases is very small, and independent of the tree 
size. However, it should be recognized that the binary 
tree that results from mapping a regular, arbitrary 
fanout tree onto a binary tree, in many cases is much 
less regular than the original tree, Figure 5. 


Two different forms of the tree description supplied to 
the root of the Tree Machine are analyzed: 


-- a binary tree, i.e., the host performs the mapping. 


-- a logical tree, i.e., the mapping onto a binary tree is 
performed by the Tree Machine itself. 


The tree description supplied to the root is constructed 
by the host from the specification given in a Tree Machine 
program. In creating this description the host is travers- 
ing or scanning the tree in a predefined order. Nodes of 
identical type and fanout appearing consecutively in scan 
order only need to be represented once, together with a 
number specifying the number of occurrences, Subtrees 
can be defined to achieve an extremely compact input for 
regular trees having few node types. 


Two scan orders are studied, normal and bit-reversed. 
Both orders progress from root to leaves, level by level. 
Normal scan order implies that nodes within a level are in 
left to right order, Figure 1.a. In bit-reversed scan order, 
nodes within a level are visited in bit-reversed order, Fig- 
ure 1.b. This ordering implies that for any node its left 
and right subtrees are visited alternately. 


Figure 1. Normal and bit-reversed scan order 


The scan order used in generating the description affects 
both its length and the complexity of the loading algo- 
rithm. The bit-reversed ordering results in a simpler 
loading algorithm because a node only needs to set a flag 
to indicate that the next input should go to either the left 
or the right port. With normal scan order, a Tree Machine 
node needs to monitor when the left half of the level 
currently being treated has been exhausted, and type as- 
signment should progress to the right subtree, as well as 
when that is exhausted and the next level of the left sub- 
tree shall be assigned types. The decoding of the tree 
description is more complex in this case. 


I. Binary tree as input 

For this case the host is assumed to perform the mapping 
onto a binary tree, if required. The padding processors 
are inserted in bit reversed order to keep the tree bal- 
anced, 


Identical consecutive nodes in the given scan order are 


represented by a pair, (NUM, ID). NUM is the number of 
consecutive occurrences of nodes of type ID. The same 
node lype may appear several lirmes in the description. 
Two special delimiters, '(' and ')', are introduced to allow 
subtrees to be grouped together in the input. Identical 
subtrees encountered consecutively in scan order only 
need to be specified once. 


Length 
NO: 3A,4B 4 
BRO: 3A,4B 4 


Figure 2.4 3-level binary Lree with Lwo different node lypes 


Length 
NO: 1A,2(2P,3B) 9 
BRO: 1A,4P,4B,2b,2B 10 


Length 


NO: 1A,2P,3(2P,3B),1(1P,2B) 18 


10 


Length 


NO: 1A,1P,3(1B,1P,3C) 13 


©) C) BRO: 1A,1P,2(1B,1P,2C,1b,1C), 32 
oO & 1b,1(1B, 1P,2C, 1b, 1C) 
Figure 5. A three-level, 3-ary tree with padding nodes 


The two different scan orders, in the following refered to 
as NO and BRO, are analyzed on three simple cases: 


Case 1. A binary tree with identical nodes at each level: 


Different levels may have different node types. The two 
scan orders render the same description. Its length is at 
best twice the number of different types, at worst twice 
the tree height since all nodes on a level of the tree is as- 
sumed to be of the same type, Figure 2. 


Case &. The binary tree description converted from a two 
level f-ary logical tree: To simplify the analysis all leaf 
nodes are assumed to be of the same type. 


For normal scan order and f < 10, a tree description can 
always such that the leaf node type only appears once in 
it. Under this condition, all the leaf nodes either appear 
consecutively in normal order in the binary tree, or there 
exists one subtree in which the leaf nodes appear con- 
secutively in scan order and repeated use of the subtree 


covers all the leaf nodes, Figure 3. 


For f > 10, the number of occurrences of leaf nodes in the 
tree description increases with f, Figure 4. Case 1 givesa 
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lower bound. A fairly tight upper bound is derived below. 


In seeking an upper bound for the number of occurrences 
of leaf nodes in the binary tree description of a two-level 
f-ary tree with all leaf nodes being identical, it is first ob- 
served that the worst case fanout must be odd. If f is 
even then the f-ary tree can always be described as two 
identical subtrees, which are represented only once using 
the subtree notation. If f is odd then there is one subtree 
instantiating an odd fanout, and one subtree representing 
an even fanout. They differ in fanout by one, and appear 
with the subtree representing the larger fanout to the 
left. The subtree notation can not be used to reduce the 
description at this level, but there are common subtrees 
at lower levels. 


The worst case occurs when the f-ary tree is described by 
one subtree of odd fanout at distance two from the root, 
and one subtree also of odd fanout at distance three from 
the root, and both these subtrees have a maximum 
number of occurrences of leaf nodes. Let x, = max}leaf 
occurrences| gkepegkt1) Then, it follows that x,= x, 5+ 
X,.5 with initial conditions Xp=X,=Xj=1. There exist 
fanouts for which equality holds. Some of the worst case 
fanouts d, can be derived by the recurrence equation 
d, =2*d,, + (-1)*1, d,=5. An upper bound for the third 
order recurrence equation X,=X, 5+X,., is given by E72. 
The length of NO is proportional to x,, i-e., NO is O(Vf ) in 
the worst case. However, the number of occurrences of 
the leaf nodes grows much slower on the average. 


In BRO the number of occurrences of a leaf node equals 1 
if f is a power of two, otherwise 2. These bounds are a 
consequence of padding nodes being inserted in bit- 
reversed order. Padding nodes are always encountered 
first. However, even though leaf nodes appear last, they 
form one group only when f is a power of two, The reason 
is that leaf nodes are attached pairwise to the padding 
nodes of the previous level. With bit-riversed scan order 
all left descendants of the previous level are treated be- 
fore any right descendant is assigned a type. Kmpty 
nodes are always encountered, except when the last level 
of the binary tree is fully populated, Thereforo, BRO is of 
fixed length, 6 words if leaf nodes appear once in the tree 
description, otherwise 10 words, Figures 3 and 4. 


Case 3, The binary tree description converted from a k- 
level f-ary tree: All nodes at one level of the f-ary trce 
are assumed identical, Figure 5. 


As the number of levels in the tree grows, leaf nodes are 
replaced by subtrees. This replacement is easily accom- 
modated in the description using the subtree notation. If 
the number of occurrences of the leaf node type in a 
two-level subtree is x, and the length of the two-level sub- 
tree description is s, then adding one level to the tree in- 
creases the description by x*s, since each occurrence of 
a leaf node is replaced by s. Therefore, there are totally 
occurrences of the leaf node type in the resulting: 3- 
level tree. Repeating this substitution process, the se- 
quence length for a_ k-level, f-ary tree is 
s(1+xtx?+x34...+x®) = g(x 1h 1)/(x-1) = 4. 


For x=1, Le., the leaf node type occurs only once in the 
description of a two-level f-ary tree, the sequence length 
is s*(k-1). It is proportional to the height of the tree. It 
is the minimum length possible. For x=2, the worst case 
for BRO, the sequence length is s*(2*1-1). For x>2, s is 
proportional to x, which for NO is bounded by Vf. The se- 
quence length r is bounded by VN7T, where N=f, 


In conclusion, the length of both scanning orders, as well 


as the processing time at the root processor is at best 
O(log,N). The worst case for f-ary trees with all nodes at 
one level being identical is O(YN/f) for normal ordering, 
and O(N!/l8ef) for bit-reversed ordering. In addition it 
should be noticed that there is always a minimum propa- 
gation time of O(logN) from the root to the leaves 


Il. Arbitrary fanout tree as input 


In this second case, the input consists of a description of 
an arbitrary fanout tree. The Tree Machine nodes insert 
padding nodes to create the necessary fanout. Identical 
nodes are represented only once if they appear consecu- 
tively in the scan order. However, for this algorithm we 
limit the grouping of nodes to one level of the arbitrary 
fanout tree. The tree description will contain at least one 
entry per logical tree level. 


An entry in the type file is of the form (NUM, ID, FANOUT), 
where NUM is the number of nodes of the same type and 
with fanout FANOUT that appears in succession in the 
scan order in one level of the input tree. Normal scan 
order is assumed, Subtrees can be defined. They are 
contained within '(' and ')’ and can be nested. Figure 6 
shows an example of a tree description for Algorithm II. 


(A) 
Input string: 


@) © 8) 
CROLOLOROROROROROM 


1A3,3B3,9B3,27C0 


CCCCCCCCCCCCCCCCCCCCCCC 


Figure 6. A 4-level, 3-ary logical tree 
The basic characteristics of this algorithm are: 


-~ A node in the Tree Machine defines, as required, one or 
both of its descendants to become padding nodes. Gen- 
erated padding nodes may have a fanout greater than 
two. Every node accepts a description of an arbitrary 
fanout tree as input. The padding node is no different 
from other nodes. It generate its own padding nodes. 
Every node uses the first entry it receives to determine 
its own type. 


-A Tree Machine node has to monitor the structure of the 
logical tree in order to dispatch the input to proper 
descendants. The nodes in one level of the input tree 
may have different fanouts. The total number of nodes 
at one level of the input tree equals the sum of the 
fanouts of all the nodes at the preceding level. A Tree 
Machine node needs to keep and update information of 
one level of the input tree, including the total number 
of nodes in the input tree at the level which the current 
input entry is filling, the accumulated total fanouts of 
those nodes, the number of nodes at the current level 
of the input tree that shall be placed in the left and 
right subtrees of a Tree Machine node, and how many of 
those that still remains to be placed. 


In the process of dispatching one input entry, two mul- 

 tiplications are required to calculate the total fanouts 
of the nodes going to the left and to the right subtrees 
of a processor, The multiplication time is proportional 
to logaf. 


The length of the input for this algorithm is at best 
O{log,N) for a regular logical tree of fanout f, At worst it 
is O(N). The total loading time is at best O(log.N). 


purmmmary and Conclusions 


Table 1 contains the loading times in instruction cycles 
for; Algorithm I.a, binary tree input using normal scan 
order; Algorithm I.b, bit-reversed order; and Algorithm II, 
arbitrary fanout tree input. The first figure for the exe- 
cution time is the time to finish processing the input at 
the root. The second figure is the total time elapsed until 
all leaves are assigned a type. The difference between 
these two figures corresponds to the propagation time of 
the last input entry. The program code is loaded immedi- 
ately after downloading the tree description. The propa- 
gation delay can be reduced by pipelining the code load- 
ing phase with the node typing phase. 


Algor i thm I.a I.b Il 
Loader 123 words 72 words 98 words 
size 
input time input time input time 


words inst. cyc. words inst. cyc. words inst. cyc. 


Ex 1. 4 75/141 4 64/122 12 =. 137/281 
Ex 2 18 255/471 13 228/488 2/7 = 413/621 
Ex 3 27 269/433 164 1426/1661 15 = 186/378 
Ex 4 6 123/339 6 64/252 6 51/235 
Ex 5 69 627/836 18 116/386 6 51/235 


Ex 1. 4-level binary tree with two different node types 

Ex 2. 9-level binary tree, one node type per level 

Ex 3. 5S-level 3-ary tree, one node tupe per level 

Ex 4. 2-level 28_ary tree, best fanout for I.a and I.b 

Ex 5S. 2-level 17l-ary tree (worst case fanout for 
Algorithm I.a, 25 2) 


Table 1. Comparison of Algorithms I.a, I.b, and II for some ~ 
simple trees 


The ratios of program sizes of Algorithms Ia,Ib, and II] are 
1.7:1:1.3. The smaller program size for Algorithm II com- 
pared to la is a consequence of the decision to require at 
least one entry per level in the logical tree in Algorithm 
Il. This constraint reduces the program complexity 
significantly. The program of Algorithm I.b is the shor- 
test among the three because the bit-reversed ordering 
simplifies the logic of the algorithm. 


Algorithm Il is the slowest of the three algorithms for 
binary trees, Examples 1, and 2g. Its processing at the 
root reguires about 50% longer time than Algorithm I.b 
for binary trees with all nodes at a level being equal. The 
processing time increases linearly with the height of the 
binary tree for all three algorithms, 


Examples 2, 3, 4, and 5, are all mapped onto a 9-level 
binary tree. The total number of nodes in each logical 
tree is 511, 121, 257 and 172 respectively. The total 
number of used nodes in the binary tree, including the 
padding nodes, is 511, 161, 511 and 341 respectively. Be- 
cause of the difference of the logical tree structures, the 
execution time differs significantly for different algo- 
rithms. Notice in example 3(5), Algorithm I.b (Algorithm 
I.a) has a performance much worse than the other two al- 
gorithms. On the contrary, Algorithm II always yields a 
satisfactory performance for a large fanout. 


The length of the input is independent of the fanout for 
Algorithm II, whereas the length of the input for Algo- 
rithm J.a and I.b depends on the fanout, Figures 7 and 10. 


For fanouts equal to a power of two the corresponding 
binary tree is always balanced. The input length for Algo- 
rithm I.b as well as the total execution time is less than 
that of Algorithm IJ, except for a two level tree, Figures 7 
and 8. Algorithm II is more efficient than Algorithm Ia 
for fanouts equal to a power of two. 


i 


250 
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50 


2? = a Pg 2" 


Figure 7. Instruction cycles for the root (1) and for 


completing the loading phase (2) for a 2-level 2"-ary tree. 
Successive levels having different node types. 


400 
300 


200 


n 


= 2* 25 26 2 


Figure 8. Instruction cycles for the root (1) and for 
completing the loading phase (2) for a 3-level 2"-ary tree. 
Successive levels having different node types. 


(2) C8) 
M4 (2) 


Figure 9. Instruction cycles for the reot (1) and for 
completing the loading phase (2) for a n-level 3-ary tree. 
Successive levels having different node types. 


However, Algorithm II performs in a superior manner for 
most fanouts. Algorithm Ib has an exponential growth 
with the height of the tree, Figure 9, except for fanouts 
equal to a power of two, Figure 8. The exponential growth 
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rate is independent of the fanout. The growth rate for Al- 
gorithm I.a varies from linear to exponential depending 
upon the fanout. It may be better or worse than Algo- 
rithm |.b, Figures 9 and 10. The growth rate of Algorithm 
I.a as a function of the worst case fanout in each interval, 
27 < f < 27+! follows a square root function, Figure 10. 


85 171 £ 


Figure 10. Instruction cycles for the root (1) and for 
completing the loading phase (2) for a 2-level f-ary tree. 
successive levels having different node types. 


The asymptotic loading times are summarized in Table 2. 
These time bounds are also bounds for the length of the 
input, except for Algorithms I.a and I.b, for which the in- 
put can be reduced to O(1) for a binary tree with all 
nodes identical. 


Algorithm La Lb I] 
Best case O(logN) | O(logN) O(log.N) 
Worst case O(VN/f)  O(N1/lo82f) O(logsN) 


Table 2. Estimated loading times for k-level, f-ary trees, 
of N nodes. All nodes of a logic level are of the same type 


In conclusion, for a non-binary logical tree, Algorithm II 
has the best asymptotic properties, O(log.N), is also 
efficient for small trees, and has a compact, simple code. 
Its performance for binary trees is close to that of the 
other two algorithms. 
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Abstract 


This paper discusses efficient routing algorithms for 
multicomputer networks organized as *econfiqguwable binay 
Anees. Communication techniques introduced are optimal from 
both viewpoints: the total bit size of routing information 
code, BS(RI), that routes the message among various transit 
nodes of the communication path and the total time of comuni- 
cation. It is shown that BS(RI) < logo K + 2 logo logs K, 
where K is the total number of nodes in a reconfigurable bin- 
ary tree. Further, since each transit node N selects its 
either right or left successor in the communication path via 
Simple logical operation that takes time of one gate delay, 
the total time, CD, of the N, >N, communication also ap- 
proaches the theoretical minimum where Ne is a source node, 
Ny isa destination node. 


i. Introduction 


Binary tree is a very popular implementation for a dis- 
tributed computing system in which each tree node is inp le- 


mented as a separate and autonomous computational node [1,2,3]. 


The reason for this is that for conventional computations, the 
tree structure describes a variety of control and computation- 
al algorithms. 


Especial ly important is the use of binary trees in dis- 
tributed data bases, since most of the file accesses algo- 
rithms for hierarchical data bases are based on queries organ- 
ized as a binary tree [4]. Trees have two types of nodes: 
Leaves and non-Leaves, where a leaf is understood as a node of 
the lowest level, i=0, and a non-leaf node has level i=1, 
where isn for the n-level tree. In this tree the node of the 
highest (n) level is called the tot. 


A multicomputer network that reconfigures into trees as 
well as other useful structures (stars and rings) can be organ- 


ized if its nodes identified with computer elements, CE, are 
interconnected with the memory-processor bus (or DC-bus) de- 
scribed in [5,6]. To organize a data broadcast among a pair 
of nodes N and N* interconnected with the DC bus, it is suffi- 
cient for network node N to generate the position code or 
address of N*. 


Activation of data path between N and N* will be denoted 
as transition N > N* meaning that N will generate the position 
codeor address of N*; and N will establish a data path be- 
tween N and N*, 


It is assumed that during reconfiguration, the maintained 
direction of succession is from leaves to the root, whereby 
each node N generates only one position code N* of its immed- 
jate successor in the binary tree. This organization wil] min- 
imize the total time of reconfiguration. 


During regular node to node communications, both transi- 
tions N > N* and N* > N will be maintained. 
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The fol lowing procedure wil] be used to generate various 
tree structures that can be assumed by a distributed computing 
systen (This procedure is a particular case of more general 
procedure described in [6,7] that can generate not only trees, 
but also rings and stars.) 


Figs 4 


Generation of reconfigurable 
binary tree with the use of the SRVB-register 


Assure that each tree node N is provided with a special 
shift-register of length n which stores its position code N, 
where n is the size of the code (Fig. 1). Then for each type 
of communication between N and N* (PE-PE*, PE-ME*, ME-ME*), 
node N uses the fol lowing xeconfiquiation equation to generate 
the position code N* of its single successor in the binary 
tree: 


Nt = IN], +B (1) 


where 1[N]. is one bit non-circular shift of N to the left and 
B is an n-bi t reconfiguration constant called bias and brought 
with the reconfiguration instruction to alé network nodes that 
ane requested for neconfiquiation. The shift-register of Fig. 
1 is called a shift-register with the vaniabLe bias (SRVB). 
Thereafter, subscript 0 will be omitted in this paper because 
it is always 0 for reconfigurable binary trees. For other 
network structures, the SRVB may perform a circular shift 
whereby the gate that follows the least significant bit is fed 
with signal 1(6]. 


In Fig. 2, the network of thirty two nodes receiving bias 


B=11010 reconfigures into the 5-level with the root R=10110. 
Since there are 2' different biases, it is possible to gener- 
ate 2" different trees with these techniques. 


In any reconfigurable binary tree, organization of effi- 
cient communication among tree nodes presents a major problem 
because position codes of tree nodes in the hierarchy of tree 
levels may change. Therefore, to route a communication nes- 
sage from a source node, N., to a destination node, Ny, re- 
quires to store the position codes of all transit nodes of the 
N. > Nq communication path. This routing information RI can 
be either stored in a communication message CM or distributed 
among transit nodes so that each transit node stores the posi- 
tion code of its successor in the N. > Ny communication path. 


1O1}1.- OOTTD. «so UNTTT— 0172710011) 000T) «1071 = OFT? = 1070 »~=—-00707_—S 17107) :0t107 = 10007-00001)»: 11001: 01001 
‘ Rs 
YA i - x 
Ai -4 1-4 


4 
YyR.0100 00100 1100 01100 


10010 


X12 


V1410 


Bias B=11010 


Fig. 2 Reconfigurable binary tree having 
bias B=11010 and root R=10110. 


Both these alternatives are infeasible since they lead to 
a large bit size of the RI-code and a long time of internode 
communication associated with the large size of RI. 


This paper is dedicated to solution of communication prob- 
lem in a reconfigurable binary tree that allows a dramatic re- 
duction in both the total bit size of the RI code and the over- 
all time of communication. Presented routing techniques al low 
one to obtain the fol lowing upper bound on bit size of RI: 


(3) 


Also, each transit node delays a CM-message by not more 
than Sty. Therefore, the maximal conmunication delay CD for 
the longest N, > Ny communication path that includes 2 logs 
K-1 tree nodes is upperbounded as fol lows: 


CD < 3ty (2 log K-1) (4) 

Therefore, introduced communication algorithms are opti- 
mal from both viewpoints--bit size of the routing information 
and total communication delay. 


The relationship of this paper with other work in the 
area is aS fol lows: 


First, by introducing original routing algorithms for 
distributed reconfigurable computing systems organized as bin- 
ary trees, it is connected with other works on reorganizations 
and reconfigurations discussed in [8-13]. 


Second, by using the shift-register theory to develop re- 
configurations of binary trees, it is connected with the lit- 
erature on shift-register sequences [14-19]. 


Third, by discussing distributed computing systems organ- 


01000 level n-4 


01010 level n-3 


level n-590 
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ized as binary trees, it is associated with [1-4]. 


To form efficient routing algorithms, it is necessary to 
develop a simple synthesis procedure aimed at finding a tree 
node of level n-i, i<n. If i=0, the node found is the root, 
R, of level n; if i>1, it is any other node of the tree; if 
i=n, it is the node of level 0, i.e., a leaf. To find the 
root. R. we define the bias structure for a non-circular SRVB 
as follows: Given bias B = GP} + ... + GP} where GP. 

(j=1, .., t) is its ones position otherwise called generat- 
ing position. Let aj, ao, ..-» a4 be bias distances defined 
as follows: GP;,, = a;[GP;], i.e, a; shows the number of 
left-hand shifts between GP. and GP;,1, where i changes fron 1 
to t-l. For GP,, a,[GP,] = 0, since we are having a non- 
circular shift tFig. 3a) 


(a) 
ea a ce em a ac” oat ean 
Bias B = 


| i 
| . F . . 
| { GPy= | : 0 Pz 1 , Po | 7 8) , GP} 1} 
| 
ml “ty i Lee Re Rie 
x X2 Xz Xy Xs % 
Formation of the root R. Py GP. PP. cP 


! 
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* i a Oe 
0 OT 
BL (GPa) = @ OOO 2 20r.0 PS3(2) = Xj + Xs > 
+ 
BL (GP) = GPs + 1(6P3] 0 1 10 00 FP 
; : ‘i % Xs 
BL (GPy) = GP, 1 0 0 0 00. Ja,l re 
CP] 
Root R = BL(GP}) + BL(GP) ee ee ee 
! RS3(3) = Xgt RSp(2) 
PSp(2) = Xp + RSC) 
| 7 
PS\(1) =X 
0 
] 


(a) Bias structure. 
(b) Formation of BL-sums and the root. 
(c) Recursive sums. 
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For each generating position, GP;, let us form a mod 2 
sun of all left j-bit shifts of GP. ranging from j = 0 to j = 
a;-l, where a; is the bias distance such that a;[GP;] = GP.41. 
Denote this sum as BL(GP. ) = Gat 1[GP.] + .. + (a_sL i 
By construction, GP. is an addend of BL(GP;) and GP.,, 1s not 
an addend of BL (CP, } (Fig. 3b). Form BL sums for odd generat- 
ing positions of the bias as BL(GP,), BL(GP3), .., BL(G,) 
where k=t if t is odd and k=t-1 if t is even. Then the root R 
is: 


R= BL(GP, ) + BL(GP3) +... + BL(GP, ) (4-1) 
The significance of this formula is i] lustrated by Fig. 


3b. 


As we have seen, the technique for constructing the root 
is very simple and can be easily performed by the progranmer 
who can find the root for the given bias B and then store it 
in the instruction that reconfigures the network into the giv- 
en single binary tree or the root R can be formed inside tree 
node via a dedicated logical circuit that includes two shift- 
registers’ interconnected with the mod 2 counter. 


Once a tree node N finds the root, R, it can form the 
position code of any other node N' of level n-i, using a so 


called recursive sum code RS;(k) defined as follows via a_ 
simple inductive procedure: 


Basis: RS, (0) = 0, 1 = 0, k= 0 


Inductive Step: RS; (k) = X; + RS, (k-1), where k < i, 
X. = 21, and m< i. 

_; is 
always 1, and (k-1) is the total nunber of other more signifi- 
cant ones positions (Fig. 3c). 


Therefore, if SRVB stores RS; (k), its position ae 


For instance, RS3 (3) = X, + Xo + Xg can be represented 
as follows via this definition: RS3 (3) = X3 + RSo (2); 


k) = Xe + RS, (kel) 


=X) 


For a recursive sun, RS. ( 


» X will be 
called a level variable. 


As shown below —_ level variable X; = 1 uniquely speci- 
fies a tree node N(n-i) of level n-i as follows: N(n-i) = R + 
RS. (k) where k<i, R is the root. (4-2) 


‘ \ / \ / 
x No of Md 
Mf Ps ‘ 
,/ 


\ . 
Ni(4)=1011001 N3 (4)= 0011001 | 1111001=N9(4) | Level 4 


N4(4)=0111001 


Ny (5)=1101001 0101001=N2(5) Level 5 


N(6)=0001001 Level 6 


Level 7 


C R=1001001 


Fig. 4 Construction of tree nodes 
of arbitrary levels. 


Examele. Given bias B = 1011011. Let us construct all 
the nodes of levels n-i = 7, 6, 5, and 4 (Fig. 4). Here, n=/. 
Thus the root is of level 7: R= BL(GP,) + bok + BL(GP5) 
GP3, and BL(GP,) = GP, thus, R = 1001001. Indeed, the suc- 
cessor N* of R is N* = 1[R] +B = 1[1001001] + 1011011 = 
0010010 + 1011011 = 1001001 = R = X, +X, +X. 


The node of level 6 is: N(n-1) = R + RS) (I)J=R+ X) = 
age = Xq + X7- This node is succeeded by the root because 
= 1(N(6)] + B = 0010010 + 1011011 = 1001001 = R There are 
ie nodes of level n-2 = : N4(5)=R+ RSo(1) = R + X_ = 
1101001 = 105 and No(5) = ol2) = R+ Xy + Xp = Xo + Xq + 
x7 = 0101001. Indeed, N 6) 1S Sarcseded by N* = I[N.(5)] + B 
1010010 + 1011011 = 0001001 = N(6) = 9. Likewise, f (5) is 
fe. by the same N* = 1[N5(5)] + B = 1010010 + 101011 = 
etc 


Simi larly, one can construct al] the other nodes of this 
tree. 
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3. Generation Inside a Transit Node, N of its Immediate 


~— Neighbors in in a Reconfigurable Binary Tree Tree” 


During reconfiguration into a new tree, each transit node 
N is the source of the N > N*¥ transition and the destination 
of the LN >N and RN >N transitions (Fig. 5a). Since LN and 
RN are on the same level, they are specified with the same 
level variable X;, i.e., LN = RRS, and RN = R+RS:. Assume 
that the left node LN is always greater than the right node , 
RN, i.e., LNDRN. To order LN and RN, we have to order recur- 
sive sums. Assure that they are ordered as fol lows: 


(a) 


Root Register 


Root R = 0100) ' 
Bias B * 11011’ 


Left Node Right Node 
= O10 


node 
LN =(X,(R) 0 1 1 0)-00110 


, RN-(X,(R) 0)- 10110 


(a) Transit node N and its closest 
neighbors in the reconfigurable 
binary tree (Root R=01001, bias 
B=11011). 


FIGs > 


(b) Generation inside node N of the 
position codes for its immediate 
neighbors in the reconfigurable 
binary tree having bias B=11011 and 
root R=01001. 


i : be -j eee =J 
Pl. RS; > RS; if Xj > X; where X; = 2", Xj = 29 and 


Application of Pl and P2 rules gives us LN = R + RS; and 
RN = R + RS; + X), i.e. LN + RN = X;. Furthermore, since the 
root, R= R+ RSo. and RS. is greater than any other recursive 
sum inasmuch as it is specified by the level variable X, = 

= 2" we obtain_that the left node LN necessarily has 

the same meaning of X, variable as X; of the root R On the 
other hand, the right I ode RN always has its X; variable re- 
versed, i.e., opposite the one of the root. 


Therefore, we can use the fol lowing techniques for gener- 


ating LN and RN codes inside node N: Using reconfiguration 
equation (1), we find that 1[LN] = N +B. Thus, 


(5) 


where 17! [...] is non-circular shift to the right and X, (R) 
1s the meaning of the most significant variable in the node LN. 


LN = X, (R) + 1}(NB] 


Similarly, since 1[RN] = NB, 
PN = Xy (R) + 1 DB] (6) 


Thus, LN and RN are obtained inside node N via the xight- 
shifted non-circular SRVB receiving bias B, whereas N* is 


obtained conventionally inside node N using the left-shifted 2. If a saved message, SM, is concurrent with a current 
non-circular SRVB receiving the same bias B (Fig. 5b). message, CM, a decision to pass is given to a SM message and a 
CM message is saved. 
4. Communication Circuits Inside Transit Node 


— TS A A SS 


3. If a message received via D-terminal (saved, SMp, or 


To perform fast internode communications, each tree node current, CMp)s is concurrent with left or right messages, a 
mustbe provided with the communication circuits that al low decision to pass is in favor of a message received via D-term 
its effective functioning as a transit node, i.e., the one inal. The remaining messages are saved. 
which is provided with three immediate neighbors LN, RN and N* 

(Fig. 5) or which merely passes communication messages to one 4. Of the two concurrent left and right messages, the 

of its neighbors LN, RN or N*. Since messages may flow in the decision to pass favors left message at even clock periods and 

two directions, TOR and TOL, where TOR means to the root and right message at odd clock periods. 

TOL means to the leaves, each transit node has T and D term 

inals respectively designated for TOR broadcast if message Example: Suppose that a transit node N has six concur- 

passes from T to D, or for TOL broadcast otherwise. rent messages competing for transit: CM, CMp, CMp, SMP» SMB 
and SMB» where upper subscript for et messages shows their 

To maintain concurrent communications whereby a message addresses in the register stacks. We assume that each stack 
received by the D-terminal is al ]owed to flow concurrently to has only one saved message and no current message arrives un- 
the left and right neighbors of the given node N, each transit til al] messages are passed. At moment T,, conflict decoder 
node is provided with the two T-terminals, LT and RT, that con- al Jows SMB to pass; CMp), is saved as new KO CM, is saved as 
nect given node N with LN node and RN node, respectively (Fig. SM CMp 1S saved as SMp; at moment Th» sf al lowed to 
6). pass; down stack has no saved messages; left and right stacks 

To To an store two messages each (Fig. 7). At even moment To, S 
passes; left stack retains one saved message; right stack re- 

TS Te) aah tains two messages; at odd moment T3, SMR passes; left and 

right stacks retain one message each, etc. The entire message 


transit ends at Ts. 


Time Messages { Left Stack | Right Stack| Down Stack | Current Messages 


Initial 
ondition 


t+ #4 + 4 


CM, CMp CMys SM, SMe Soy 
Fig. 7 Conflict resolution among 
[tr Wey. communication messages. 
Fig. 6 Communication circuits inside 4.2. Dynamic Activation of Transfer Modes 
each tree node. in Connecting Elements 
Each connection of the transit node N with its successor As was indicated above, each transit node N is connected 
N* and the two predecessors LN and RN in the tree is performed with its predecessor (LN or RN) or successor N* via a dedi- 
via a pair of connecting elements, MSE, selected during recon- cated pair of connecting elements selected during recon igu- 
figuration. ration (Fig. 6). For the N > N* broadcast, (MSEy-MSEqy)-pair 
belongs to N; for N > LN broadcast, CHEE PEN) Par be longs 
4.1. Conflict Resolution Among Concurrent Messages to LN; for N>RN broadcast, (MS MSE gy) par Tongs to RN 
(Fig. 1). The mode of transfer of each pair of connecting ele- 
Since there are three independent terminals in each tran- ments depends on the direction of broadcast. For the TOR com 
sit node, it may receive up to three concurrent messages at a munication N + N*, from N to its successor, N*, connecting 
time: CM,, CMp and CMp. Since it may pass in transit only element MSEy should be activated in the wit direction 
one message at a time, the node performs the conflict resolu- (w(MSEy)) and MSEy« should be activated jn thexead direction 
tion via special logic called conflict decoder, CD (Fig. 6). (r(MSEyx)). For TOL communication N > LN, the direction 
Conflict decoder is a logical circuit that has a separate of transfer in NSE and MSE; y reverse to read for MSEy and 
output for each conflict situation. write for MSE), (Fig. 6). Similar directions are true for the 
N > RN broadcast 


The decisions made by the conflict decoder are based on 
the following rules: Since at each clock period, a transit node N may change 
the direction of broadcast, we assume that the modes of trans- 
1. Of several messages that must use node N in transit, fer in connecting elements will be activated dynamical ly by 
only one message is al lowed to pass at each clock period. The the messages that are al lowed to pass through a given output 


remaining messages are saved. Since a transit node has three terminal. 

terminals (LT, RT, D), each terminal is provided with the indi- —— 
vidual push-down stack of save registers used to store saved For the TOR broadcast via D-terminal, such activation 
messages, SM. 7 wil] be performed by the L)-logic which receives several 
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Atatic snouts and one dynamic input from the al lowed message 
(saved or current). Static inputs are: code N for selecting 
MSEy3 code N* for selecting MSEyx; r-signal for r(MSEyx) and 
w-signal for w(MSEy). The L) logic is activated concurrently 
during TOR message transit via N; thus, it introduces no ad- 


ditional delay into a message transit. For the TOL left broad- 


cast via LT-terminal, this dynamic activation is performed by 
the Lo logic which issues r(MSEy) and w(MSE; y). 


Similarly, for the TOL right broadcast via RT-terminal, 
L3 logic performs dynamic activation of two connecting ele- 
ments MSEy and MSEpy, as r(MSEy ) and w(MSEpy ). 

Since the same pair of connecting elements MSE and M 


that connect N and N* may be activated by N and N* in the oppo- 


Site directions, MSEy and MSEy x are provided with a simple 
logic to resol ve these conflicts. The organization of this 
logic is similar to the one discussed above for resolving con- 
flicts among concurrent messages. The only difference is that 
it is much simpler, since there is only one type of conflict 
that may arise. Thus, the conflict logic will be reduced to a 
Single 2-input gate. 


5. Individual Communications 
Below, we will introduce communication algorithms which 


allow efficient internode communications. Two types of com 
munication will be considered: 


1. Node-Root communication, and 


2. Node-Node communication. 


5.1. Node-Root Communications 
There are two types of node-root communication: (a) node 
+ yoot and (b) root > node. 


0.1. Node to Root Camuication, N.>R.For the Nco* 
R communication, Nc is the source, R is the destination; 
No generates communication message CM that stores the position 
code of the root, R, defined in Eq. (4-1). In passing this 
message, each transit node, N, compares R with its own posi- 
tion code, N. If N = R, the message reaches the destination. 
If N # R, it is passed top dow from one of the top terminal 
(LT or RT) to the only one D-terminal. 


9.1.2, Root-Node Camunication, R*>N~ For this case, 
the message is generated by the root, i.e., R is the source, 
Nq is the destination. Message, CM, stores a so-called ad- 


dress code of N, defined as follows: 


By the addness code, ACy of the node N, we define such a 
binary word which contains all the routing information that 
al lows a comunication message to reach node N from the root 
via the minimal communication path (i.e., the one which tra- 
verses each node of the tree only once). 


Below, we define a procedure which allows one to obtain 
the address code, AC, of node Ny via simple logical opera- 
te performed over its recursive sum, RS; introduced in 

q- (4-2) 

As was specified by a (4-2)node Ng of level 

(n-i) is defined as follows: 


Ng = RERS. where RS, is the recursive sun. 


The immediate more significant successor of this node in 
the tree otherwise called reconfiguration l-successor, can be 
found via the reconfiguration equation (Equation 1) as: 
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N* = 1[Ny] + B= I[R] + 1[RS,] + B. 
Since for the root, R, 


R= 1[R] + B, 
N* = 1[R] + 1[RS.] + B= RF1[RS, ] = R+RSs_] 


Therefore, N* (the reconfiguration 1-successor of Nj) is 
the node of the next more significant level, n-(i-1), defined 
via recursive sun RSj_4 which is a one-bit shift to the left 
of the original recursive sun RS; that specifies node N,. 
Similarly, it is easy to show that the reconfiguration 2-suc- 
cessor of node Ny in the communication path with the root is 
specified by recursive sum RS._5 which is 2-bit shift of RS,: 
RS._9 = 2[RS; I, where RS. specifies node N. 

Since by definition, level variable X.=1 specifies the 
least significant position of RS; of the node Ny it will be 
the last one which will be shifted out during consecutive 
shifts of RS;. Further, since X.=1 in RS, it will always 
specify the only routing alternative from the root R to the 
next less significant node R+X, of level n-1 (Fig. 2). 


The value of X; 


in RS; which is next to X, specifies 
the route from level 


il to level n-2. 
If X;_] = 1, the only node RtX, of level n-1 is fol lowed 
by the right riode, RN = RtX;+X, of level n-2 If Xj_1 = 0, 
node R+X, is fol lowed by the left node LN = RtX, of level n-2. 
Similarly, it is easy to show that variable X;_» is 
responsible for selecting the route from level n-2 to level 
n-3, etc. Therefore,_as fo] lows fran this analysis, consecu- 
tive variables Xj,.Xi_15 Xz_os ..» 4, that are present in RS; 
where X. = land Xj=0 or 1 are responsible for the entire 
route selection from one level to the next in the minimal 
communication path from the root R to the destination node N,. 
0 1 5 


0 | RS = Xq + RSp CL = 7-(3-1) = 


(a) 


AC = rot(RSq) 

bit) bit 2 bit 3 bit 4 bit 5 bit 6 bit7 

CL RCL destination CL=6 level n-5=0 

CL ACL AC=11010 CL=5 level n-4=1 

CL ACL Ac=11010 Cl=4 level n-3=2 (b) 
CL y#CL AC=11010 CL=3 level n-253 

CL yf cL AC=11010 CL=2 level n-1=4 

AC=11010 CL=1 level n=5 


routing information code, RI 


pee son Eee aaa 


{pei 
CM-message CLy=6 AC =11010 RS ,=R+Ng = 10110 + 11101 = O1011\ Cc} 


output, terminals 


CM message 


CL 


: | : 
Formation of the address code AC=1011000 from 


(a) 
the recursive sun RS=1101000. 

(b) Construction of the path from the root R=10110 
to node N,=11101. (c) Routing information code. 


Fig. 8 


Also,the recursive sun RS. = Rt, contains all the routing 
information necessary to forward Phe message issued by the 
root to the destination node Namely, the address code AC, 


N 
may be obtained from RS; = Nah by moving its bit X= to the 


bit 1 of AC; X; has to be moved to bit 2, to bit 7 
of AC (Fig. 8a). Tha simple logical operation a be cal led 
notate or ae AC = rot(RS;) (Fig. 8a). 


Therefore, as was shown, the address code AC, of the des- 
tination node N, obtained from Bier Sa ve sum = NR 
via simple rotation, as AC ) contains ait the rout- 
ing information necessary ik ae ial message from R to Nj. 
However, to reduce the routing algorithm to execution of a 
simple logical operation in each transit node, we will stipu- 
late that each node of level n-j wil] store a so-cal led conpZe: 
mented Levet code, CL (Fig. 8b) of the size logon=logalogok 
that shows which bit of the address code must be selected for 
finding the route from the given node to the next node in the 
path. To find CL, each node finds its individual recursive 
‘sun as RS; = RH\, then shifts RS; to the LSB until LSB = 1 and 
counts the number of shifts k; cL = = n-(k-1), where n is the 
number of levels in the tree. Each node finds its CL when a 
new tree configuration is established. Thereafter, it is 
Stored in a special register CL, of size logy n, since it wil] 
select the routing bit in all communications shat wil] pass 
through a given transit node (Fig. 6). 


Routing Algorithn for R>+N, Comuntcation. Given: root 
R, destination node n node Nu; ‘CL value stored in each node of the 
tree. The algorithm routes the message issued by R to a des- 
tination node N.. The algorithm contains two steps: wot 
Atep and transit Atep. The root step is aimed at finding the 
address code, AC, of the destination node, forming the RI code 
for the communication message, CM and passing CM to the only 
node of level n-l. Transit step is executed iteratively in 
each transit node of communication path. It is aimed at find- 
ing the next transit node of the conmunication path and moving 
the CM message to this node. 

The entire execution of this algorithm is shown in Fig. 
8(a,b,c), for finding the route from the root R = 10110 to 
node Ny = 11101 (Fig. 2). Fig. 8a forms the address code ACy 
of the node N,. Fig. 8b shows how to find CL code in each 
transit node of the path R>N.. Fig. 8c forms the RI code 


and shows what actions are performed in each transit node N 


Cl Cerg iat AC 


4 Logp logs 1024 K ~ 1024 nodes 
10 = logy K = log)l024 
Tree has 1024 nodess 


RI = 18 hits 


Na 1011011011 


—p ee a ae ee ee a —— sueeuse «meee ocqeerm=rc =m «assem «eee oe 0 occu ome meee 0 0 wel nee —_ = 


—_ — = am —_ET ae ee Ce ee eee Oe eee cee eee ee 


Fig. 9 (a) The RI-code for the Node Ns > 


Node Ng communication path. 
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5.2. Node-Node Communication, Ng 


that receives CM message from its predecessor in the comuni- 
cation path. 


> Ng 


The node-to-node communication is a more general case of 
N, +R and R> N4 communications considered where N, is the 
source node, N, 1s the destination node. Indeed, for any No > 
Ny communication, the task of an optimal routing technique 1s 
to find a so-called the closest intermediate root, CIR, which 
is understood as the tree node that connects N, and Ny with 
the minimal communication path N. T CIR, CIR > Ne (Fig. 9), 
i.e., for N. >R; R > Ng, CIR = R Thus, by developing ii 
ient node-to-node communication techniques that bypass the 
absolute root, R, we will reduce the amount of traffic that 
goes to and from the root, R, and minimize the total length 
and the total delay of the camunication path that connects N, 
and Ny: 


A particular case of the N, > Ny communication is when 


(1) N. is a transit node on the path that connects the 
root, R, with Nae or 


(2) Ny is a transit node on the path that connects the 
Ne with the root, R. 


The first case is reduced to the R > N, communication, 
whereas the second case is reduced to the N. > R communica- 
tion considered above in Section 5.1. Therefore, our discus- 
Sion will concentrate on finding the minimal N, > Ny communi- 
cation path in which neither N. nor N, are on te subpath that 
connects the root with another” node (Ky, or N.). 

0.21, Closest Intermediate Root. Io find the minimal 
Ng > Ny communication n path, we have to find first, the closest 
intermediate root, CIR (Fig. 9). 


Since CIR is on the subpath that connects the root, R, 
with both N. and Ny, the address code, ACpyp of the CIR is a 
left prefix of the two address codes, AC, and ACy, because if 
root, R, generates communication message to N. or Ny, then the 
CIR is the transit node on both paths, R > N>, R +g Thus, 


every message issued by the R root passes CIR | before q. 
reaches N. and No. 


Therefore, the address code of CIR, ACcrp, 


Equal ity Selected ce od ah 
Condi tions Variables 
TOL move L i AC go 
CL = Cy 3 
destination 
CL # CL Xy =i] AC = 1001 
CL? Chg Xz = 0 AC = 1001 
| CL # CLy Xy = 0 AC = 1001 


(b) Formation of the N. +N 
communication path. 


is the common most significant portion or the common Left pre- 
{Xx Of AC. and ACY. 


The nunber of bits in AC-7p can be determined as fol lows. 
First, we find MS = AC, + AC,, and the nurber of consecutive 
most Signficant zeros in MS. This will give us the number of 
bits in the common left prefix of both AC, and AC, or the bit 
size of ACryp. Thus, the fol lowing technique can be used for 
finding the address code of the closest intermediate root, 
ACcIp» and its complemented value, CLrqp. 


Algorithm for Finding Aoqp and Cloqp 


Given: Nes Nq and R 


Result: ACerR and Clotp 
Step 1: Find AC, and AC. 
Step 2: Find AC, + AC, = M. 


3: Find the number of consecutive and most 
significant zeros, LO, in MS; aie is the left prefix of ei- 
ther AC, or AC, that contains L Binary digits. Complemented 
level value of CIR, Clyeyp = LOH. 


Example: Let us find AC-ip and Cleqp for Ng and Ny given 
by the following position codes: 


Ny = 1101110100, Ny = 1011011011. Assure that these are 
tree nodes in a binary tree having root R = 0010011011 and 
He B = 0110101101 (Fig. 9). First, we find address codes 

AC. and ae To do so, we have to find recursive sums RS(N, ) 
and RS(N 


was = NoR = 1101110100 + 0010011011 = 1111101111; 


RS(Nq) = NgtR = 1911011011 + 0010011011 = 1001000000. 
Address code of N, is found by rotation of RS(N, Je 


= rot (RS(N.)) = 1111011111. Similarly, AC, = 
(es nig)®= 1001000000" Let us ind MS = AC tACy = 1 patter 
+ 1001000000 = 0110011111. In the MS, the nunber of 
consecutive most significant zeros, LO = 1. 


Therefore, ACcyp = 1000000000 and Clpyp = LOH = 141 = 2. 


Position code of CIR can be found by rotating its address 
code to obtain its recursive sum: RS(CIR) = rot(AC-yp) = 
1000000000. Thereafter, the CIR position code is cia 
R+RS(CIR) = 0010011011 + 1000000000 = 1010011011. 


0.2.2. N. > Ny Caummication Path. To reach destinatior 
node N,, the Source node Ng has to form communication message 
that contains the fol lowing routing information code: RI = 
: p CLy AC, (Fig. 9) where: CLetp is the complemented level 

atte a CIR-node; CLy is the Cattp enented level value for 
pare node, Ny, and AC, is the address code for N, All 
these values can be found using the techniques presente 
above. 


Each communication path from N, to Ny includes two sub- 
paths N. > CIR, and CIR > Ng, where Ne + CIR is the direction 
of TOR (towards the root R) and CIR ©N, is in the direction 
of TOL (towards the leaves) (Fig. 9b). qa moving in the TOR 
direction from N, to CIR, a communication message uses the 
code CLyry to reach CIR. Namely, in each transit node N of 
the N. > Bip Subpath, Cl>jp stored in the message is compared 
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with the local Cly. If CLetp # Cly, communication message is 
forwarded to the next node 7 the path. If Clot = CLy, the 
destination node, CIR, of the N, > CIR path is reached. Hav- 
ing reached CIR, the message begins the TOL movement (towards 
the leaves) using ACy and CLy portions of the RI code (Fig. 
9b). Namely, in each transit node, N, of be cl, > Ny path, 
CLy is compared with the local Cly. If CL » the node N 
finds the value X of the ee code, ia” fees with the 
message, where CL is the local level value Code, If Xy = 1, 
the message is forwarded to the next right node. If X = 0, 
the message is forwarded to the next left node. If CLy = Cly, 
N= Ng, i.e., destination is reached. 


Example: For the N, > Nq communication path of Fig. 9a, 
the routing information code, kr, of conmunication message is 


Shown in Fig. 9a. As seen, Clyryp = 2, CLy = 5 and AC, = 
1001000000. For the TOR broadcast N, > CIR, the message 
reaches CIR having CL = 2. Thereafter, it starts the TOL 
transfer, CIR>N,. For the CIR-node, Clojp = 2 Therefore, 
the 2nd bit of the AC, = 1001 shows what node of the tree fol- 
lows the CIR-node. Since X» = 0, CIR is fol lowed by the left 
node LN. In the LN node, CL) = 3 and X3 = 0 in AC, = 1001. 
Thus, LN is fol lowed by the next left nice, LN, of the path, 
etc. The destination is achieved in the node N having Cly = 
Cla. 


As follows for the N. > 
the routing information code, 


N4 communication, the bit size of 
Kt, is upperbounded as follows: 


BS(RI) < nt2logon = logoK+2logslogoK 


where n is the number of tree levels and K is the number of 
tree nodes. 


Conclusions 


In this paper, we discussed efficient communication al go- 
rithms for multicomputer networks organized as reconfigurable 
binary trees. 


Organization of optimal node to node communications, N. > 
Na was considered where Ne is an arbitrary source node Ng ig 
an arbitrary destination node. 


Communication between N, and Ny is organized via the min- 
imal communication path, ie, the destination node N, is 
achieved in such a way that each transit node of the comuni- 
cation path is included only once in the path N,>N,. Com 
munication techniques introduced are optimal from both view- 
points: the total bit size of the routing information code, 
BS(RI), that routes the message among various transit nodes of 
the communication path and the total time of communication, 
CD. It is shown that BS(RI) < logoK + 2logologoK, where K is 
the total nunber of nodes in a binary tree. Also, since only 
one bit of the RI code is responsible for selecting the next 
node in the path of the two possible alternatives LN or RN, 
the BS(RI) achieves absolute theoretical minimum, because it 
is impossible to have 0 bits assigned for such a selection. 
Further, since each transit node N selects its successor RN or 
LN in the communication path via logical operation that takes 
time of one ty, the total time of N, +N, communication also 
approaches the theoretical minimum It should be noted that 
the importance of these binary trees for computation cannot be 
underestimated, because of the fol lowing unique attributes 
they possess: (a) reconfiguration into a new binary tree 
takes the time of one clock period and requires only one con- 
trol code (bias, B) to be received by all tree nodes requested 
for reconfiguration; (b) these trees are extremely fault-tol- 
erant, since during one clock period, one can purge out al] 


faulty nodes and form a gracefully degraded tree out of fault- 


free nodes [21]; (c) as was shown in this paper, communica- 
tions in such trees are also extremely efficient from both 
time viewpoint and the bit size of the routing information 
code. 


Therefore, one can utilize broadly unique properties of 
these reconfigurable binary trees for various computationa] 
and control algorithms without paying the price of consider- 
able reconfiguration overhead when no computation can be per- 
formed and large size of the address portion of conmunication 
message which are the attributes of conventional reconfigur- 
able binary trees. 
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ABSTRACT: We develop some fundamental 
algorithms for d-dimensional pyramid com- 
puters, where a 1-dimensional pyramid is 
typically called a tree and a 2-dimen- 
Sional pyramid is what is commonly meant 
by a pyramid computer. We give optimal 
a(n/le(n)) algorithms for sorting and 
merging n items stored ina tree, and 
show that for higher dimensional pyramids 
well-known algorithms are optimal. We 
give a selection algorithm, suitable for 
pyramids of all dimensions, which runs in 
less than n**c¢ time for any e¢e>9. 

We also consider median filtering of a 
noisy digitized picture, giving an opti- 
mal algorithm whose running time is 6(D) 
when using a PD x D- window. 


1. INTRODUCTION 


Sorting, merging, and selection 
(i.e., finding the k'th item) are funda- 
mental problems, and for almost any 
machine architecture efficient algorithms 
have been devised for them. However, for 
certain tree and pyramid computers we 
show that when the data is already in the 
base of the computer the "obvious" algo- 
rithms are not always optimal, and we 
present algorithms superior to any previ- 
ously published. Some authors [2,3,8,11] 
have presented optimal sorting algorithms 
for tree machines in which the data 
passes through the root, but we are 
interested in situations where the data 
is already present and must be rear- 
ranged. Such situations arise when a 
different key is now being used to deter- 
mine the ordering, or when data can he 
entered directly into the base. This 
latter possibility has been raised for 
pyramid machines devoted to image proces-— 
sing, for which the speed-—ups introduced 
here may be of use. If the data must 
pass through the apex then all algorithms 
are at best linear in the number of 
items, and [2,3,8,11] showed that linear 
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algorithms exist. However, in our situ- 
ation one can obtain sublinear algorithms 
for sorting and merging, where the speed- 
up denends upon the dimension of the 
pyramid (defined below). For selection we 
show that the possible speed-up is even 
more impressive, giving an algorithm 
which finds, on a pyramid of any dimen- 
sion, the k'th item among n in o(n**€) 
time for any ¢«>O . 


Qur interest in pyramid machines 
stems from our belief, and that of many 
others, that their geometric hasis makes 
them a natural parallel architecture to 
consider for various database [2,3,11] 
and image processing [4,5,6,9,12,13,15, 
16,17,19] applications, applications in 
which significant parallelism is possi- 
ble. Pyramids are more attractive than 
meshes because they have the potential 
of logarithmic algorithms. Furthermore, 
the regularity of a pyramid structure 
makes it feasible to construct such ma- 
chines with a large number of processing 
elements. Several tree and pyramid ma- 
chines are in various stages of design 
[2,3,8,11,15], and we expect that more 
advanced machines, with a very large 
number of processing elements, will he 
constructed. 


Our results are primarily of theo- 
retical use, in that the algorithms are 
somewhat complicated, but they do at 
least show that certain problems have so- 
lutions which are asymptotically better 
than was previously thought. Further- 
more, our algorithms tend to use the 
pyramid structure in ways that are dif- 
ferent from previously published algo- 
rithms, and perhaps such techniques will 
prove useful in other problems. 


Throughout we will use lg to de- 
note log base 2, and 6, 2 , 0 , and o 
to denote "order exactly", “order at 
least", "order at most", and "order 
strictly smaller than", respectively. 
Optimal will always mean optimal to with- 
in a multiplicative constant. 


2. MACHINE MODELS 


Several different models have been 
proposed which might naturally be called 
"tree" or “pyramid” machines 2,344.5. 6; 
oes Tatoos LY eho}. Tae machines: Cons1d= 
ered here can be viewed as layers of 
mesh—connected machines, with connections 
between the layers. To define a pyramid 
of arbitrarv dimension, we first define a 
mesh—-connected computer of arbitrary di- 
mension. Throughout we will use d_ to 
indicate the dimension. A d-dimensional 
mesh-connected computer of size n**d (a 
€-MCC “OL Size n= "O) CONSLSES of 1nd 
copies of a single processing element 
(PF) , arranged in a d-dimensional cubic 
lattice. PEs have indices of the form 
(Teoeeeela)- «where. Os hie. Mel. “for 
i eG... RS: ae ,ld) and 
bald Jd) are Aeon bors Tey ane. OnLy: ae 
ae (i eee 1 > Fach PE (except 
thse on sides) has 2*d neighbors and 
has a unit-time communication link to 
each of them. 


\ d-dimensional pyramid Seu as ‘OF 
size n** a 1—!} OL size wes con- 
sists of a d-MCC of size eee ea 
d=MCG OF Size “Gay 2)" dy.ves, cand.a 
d-MCC of size 1, along with additional 
communication links specified below. 
(Note that a d-MCC of size n**d_ has 
exactly n**d PEs while a d-PC of size 
n**d has between n*¥**d and 2*n**d 
PEs.) The d-MCC of size n**d is 
called the base of the pyramid, and the 
d-MCC of size 1 is the apex. The d-MCC 
of size (n/2**k)**d is at level k , 
e.g., the base is level 0 and the apex 
is level le(n) . A PE at position 
(I1,...,Id) on level k is connected to 
all 2**d of its sons, which are those 
processors on level k-l which are at a 
position of the form (J1,...,Jd) , where 


level 3 


level 2 


level 1 


level O 


A 1-PC of size 8 
a) 


each Ji is either 2*Ti or 1+2*Ti 
Thus, except for processors along the 
sides, each PE is connected to 2**d + 
1 + 2*d others, namely its 2**d sons on 
the level below, its parent at the level 
above, and 2*d neighbors at the same 
level. Figure la shows a 1-PC , typical- 
lv called a tree machine, and Figure 1b 
shows a 2-PC , which is often what is 
meant by a pyramid machine. To date we 
know of no serious interest in pyramids 
of hieher dimensions, but since our 
algorithms naturally extend to higher 
dimensions we have done so. We also note 
that there is beginning to be some inter- 
est in 3-dimensional digitizations [7], 
and 3-PCs are a natural architecture for 
such problems. 


We assume that the PEs have a fixed 
number of registers, each of which holds 
a word of length @(leg(n)) , and all 
operations take unit time. All communi- 
cation links are bidirectional, and any 
PE can send and receive a word along any 
or all links simultaneously, all taking 
unit time. We also assume each processor 
holds its level and its coordinates witn- 
inthat level. If this is not the case, 
it can easily be performed in 6(l1g(n)) 
time. (Note: all analyses of time 
assume d is fixed and consider time as 
a function of n . While some authors 
[10] also consider d as a parameter, 
this is made difficult by the fact that 
a PE in a d-PC is fundamentally 
different from one in a (d+l1)-PC) . 
Earlier pyramid models [4,12,13,19] 
assumed that the PEs were copies of a 
finite state automation which was 
designed independent of n , in which 
case a single PE could not store the 
level it was on nor its position within 
that level. Using finite state automata 
also drastically changes the nature of 
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b) 


Figure 1 


sorting, merging, and selection since 
there can only be a fixed number of keys, 
independent of n. 


3. SORTING AND MERGING 


[2,3,8,11] considered the problem of 
using machines similar to a 1-PC to 
perform selection, insertion, deletion, 
and retrieval operations. Requests 
arrive at the apex at unit time intervals 
and are performed on-line, although there 
is a logarithmic delay between the time a 
request arrives and its answer appears. 
This approach sorts n items in 4(n) 
time, and is optimal if all items are 
required to pass through the apex. How- 
ever, we are interested in situations 
where the data is already in the pyramid, 


stored one item per PE in the base. 
For example, in a d-PC of size n**d 
we can sort n**d items in O(n) time 


by ignoring everything except the base 
and using the d-MCC sorting algorithm 
in [18]. This compares quite favorably 
with the &(n**d) time required if all 
items pass through the apex, and it is 
natural to ask if we can do even better 
by utilizing the rest of the pyramid. 
We prove that further speed-up is pos- 
sible if and only if dz=l . 


Suppose we have a 2-PC of size 
n**2 sitting "naturally" in 3-—dimension- 
al space, by which we mean the layers are 
all square grids parallel to each other, 
with each parent above the middle of its 
four offspring. If we consider a plane 
cutting perpendicular to both the base 
and one of its edges, and passing just to 
one side of the apex, we see that the 
plane will cut mn wires at level O , 
n/2 wires at level 1,...,2 wires at 
level lg(n)-l , and 2 wires connecting 
the apex to level lg(n)-l , cutting a 
total of 2n wires. To sort n**2 
items stored one per PE in the base it 
may be that all of them must pass through 
the plane, giving a worst case time of 
2(n) . Furthermore, on average half of 
the items must pass through the plane, 
giving an expected time of &(n) also. 
Since one can sort in 9(n) time using 
only the base, we see that the rest of 
the pyramid is of no significant use. 
This argument holds whenever d #22. 

However, if we consider a 1-PC of 
size mn and cut it by a line slightly 
off-center, the line will cut lg(n)+1i 
wires. This gives a &(n/lg(n)) lower 
bound for the worst case and expected 
case sorting times, a bound which is un- 
attainable using the base alone. 


Theorem 1 The algorithm outlined below 
For sorting on a 1-PC of size n_ has a 
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ing on a 1-PC of size 


time must be 


worst case and expected case time of 
S(n/lg(n)) , and is thus within a con- 
stant multiple of optimal. 


Our algorithm is a simple merge sort, 
utilizing the fact that the two sons of 
the apex can be viewed as the apexes of 
disjoint subpyramids. 


To sort on a 1-PC of size n, 
n 21 =, sort each half (separately 
and in parallel), and merge. 


If S(n) denotes the worst case sorting 
time, we have 

Stl): 20 

S(n) = S(n/2) + M(n) 9 es 
where M(n) is the worst case time to 
merge two runs in a 1-PC of size n 
We see that 

S(n) = Ji8™ veas2*%K) 


k=0 


In Theorem 2 below we show that M(n) = 
8(n/lg(n)) , which proves Theorem 1. 


MERGING 


Suppose we have a run R1 of items 
in processors Q..k of the base, and a 
run R2 of items in processors 
(k+1)..(n-1) . We merge these into a 
simple run as follows: 


1. Find the median of all the items 
(since n is even we will just 
use the n/2'th item). 


2. There are 4 subruns to consider: 
those items in R1 less than or 
equal to the median (called sub- 
run S1), those items in Ri 
greater than the median (S2) , 
those items in R2 less than or 
equal to the median (S3) , and 
those items in R2 greater than 
the median (S4) . Si and S4 
stay in place, S3 moves behind 


Sl , and S2 moves in front of 
S4 Notice S1 and S83 
together hold half of the items, 
as do S2 and S4. 


3. Merge within each half (sepa- 
rately and in parallel). 


Theorem 2 The above algorithm for merg- 
n has a worst 
case time of 9(n/lg(n)) , and hence is 
within a constant multiple of optimal. 
Proof: worst case 

To show that 
the time 

can be done 
First the 

is sent up to the apex and 


As for sorting, the 
Q(n/le(n)) . 
this is attained we analyze 
spent in each step. Step 1 
by a simple binary search. 
median of Rl 


hack down to the hase, at which time each 
processor whose item is less than or 
equal to this sends up a 1 These are 
summed, after which it is known if the 
value is too high or low. Then the 
median of R2 is sent up, then an appro- 
priate quartile point of Rl , then a 
quartile eae of R2 , etc. There are 
at most O®(lg(n)) probes, each taking 
8(le(n)) time, for a total of 
O(lg(n)**2) time for step 1. 


It is step 2 which takes most of the 
time. It can he reduced to two appli- 
cations of the following problem: how 
fast can a run be moved to its destina- 
tion? To do this rapidly we divide the 
hase into M = {le(n)/2J blocks. PEs 
QO..kn/MJ-1 are in block 1, 
Ln/MJ..2* In/MjJ-1 are in block 2, 
We define a sequence of processors 
G1,...,GM by taking Gl to be the apex 
and Gli+1) as Gi's right son. For 
each it we construct a path Pi- con- 
necting Gi to the leftmost processor in 
block i , with the property that all 
paths are disjoint. The paths are con- 
structed recursively, with Pl being the 
leftmost edge of the pyramid. For i>1, 
Pi is construced from the base upwards, 
at each processor going upwards if it 
does not run into P(i-1) , and otherwise 

eoing right. (See Figure 2.) It can be 

shown that the longest path has length 
o(n/lg(n)) , and in o(n/lg(n)) time 
each processor can decide which, if any, 
path it is on and which of the processors 
it is connected to are on the same path. 


etc. 


We notice that when a run needs to 
move, for any block i containing some 
items in the run there is a j- such that 


either all of those items are to be moved 
» or else some go to block j 
jt+l 


to block j 


and the others go to block We 


call block j the goal of block i , 
and move all the run’s items from block 

i to block <j If some of these belong 
to block j+1 then we move them along 
the base from block j to block j+l 

We construct a path Qi from the left- 


most PE in block i to the leftmost 
PE in block j as follows: let 
k=min(1, j) Qi follows Pi up to 


level k , then moves sideways to meet 


Pj , and then goes down Pj Since any 
block is the goal of at most two blocks, 
it is easy to see that any procesor is in 
at most 3 of the Q paths. An item 
starting in block i moves left until it 
reaches the start of Qi , then follows 
Oi , and then moves right in block j 

and also in block j+1 if that is its 
destination. The total distance is 
O(n/lg(n)) , and since no procesor is in 
more than 3 paths the data movement can 
be arranged so aoe the worst case time 


for step 2 is 9(n/lg(n)) . 

Therefore M(n) , the worst case 
time to merge two runs in a 1-PC of 
size n , satisfies the equations: 

M(C1) = 0 

M(n) < A*lg(n)**2 + B*n/lg(n) + 

M(n/2) n> 4. 
which gives M(n) = 8(n/lg(n)) . This 


completes the proof of Theorem 2. 


4, SELECTION 


In this section we consider the pro- 
blem of finding the k'th smallest item. 
To simplify notation we will discuss only 
1-dimensional pyramids (i.e., trees), but 
all of our results hold for pyramids of 
higher dimensions. We will also simplify 
discussion by assuming that all the items 
have distinct keys. 
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The easiest selection problem is 
k=1 , which can be solved by having each 
base processor pass up its item, and each 
higher processor pass up the smallest 
item received. This will take O@lg(n)) 
time, and Tanimoto [16] observed that if 
the apex repeatedly discards the smallest 
item and asks the son which passed up 
that item to pass up another then one can 
find the k'th smallest item in 
O(lg(n)+k) time, which in the worst case 
requires 98(n) time to find the median. 
We can do better by sorting the base and 
passing up the middle item, taking 
B(n/lg(n)) time. For d>1 the dif- 
ferences are more dramatic, going from 
6(n**d) down to @(n) . (Tanimoto had 
noted this improvement for d> 1.) By 
abandoning sorting we are able to do far 
better, finding the k'th item in o(n**e) 
time for any _€ >90. 


To solve the selection problem we 
need to solve the weighted selection pro- 


blem, in which we are given k and N 
pairs (vi,wi) , where vi is the value 
of the pair and wi is its positive 
integral weight. The vi are all dis- 
tinct, and we want to find the vI_ such 
that 


Y{wi:visvI} 2k and ){wi:vi<vI}<k. 
Each of the original items in the base 
has a weight of 1, and intermediate cal- 
culations produce items with greater 
weights. 


Initially each item is "active", and 
may later hecome inactive when it is 
known that it cannot be the answer. 
algorithm is: 

if ne=1 

élse if 


The 


then the item is the answer 
n=2 then both items are 
sent to the apex, 
which determines 
the answer 
else repeat | 
Stage [: Each processor at level 
lg(n)/2 takes as its value the 
median of the active items beneath 
it, and as its weight the number 
of active items beneath it. 


Stage 2: The apex finds the 
weighted median of the items found 
in the previous stage, call this 
W , and transmits it to all pro- 
cessors in the base. 

Verification: Each base processor 

sends up a 1 if its item is less 

than or equal to W . These 1's 

are summed on their way up to the 

apex, which determines if W is 
the k'th item or is too large or 
too small. In the last two cases 
it sends down a message which 

deactivates all items as large as 
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" , if W is too large, or all 
items AS Small -as “W TE it 2s. too 
small. 


until the k'th item is found. 

(This algorithm is similar to one in 
[14] , which was for a mesh-connected com- 
puter with broadcasting.) An important 
feature of this algorithm is the fact 
that. on each iteration at least 1/4 Gr 
all active items become inactive, and 
hence at most losyygin) tleracions: are 
required. To see why the 1/4 appears, 
suppose on some iteration W was too 
high. The weights guarantee that at 
least 1/2 of all active items are heneath 
a processor of height lg(n)/2 which, 
during stage 1, picked an item at least 
as large as W . For each such processor 
on the middle level, at least half of the 
items beneath it are as large as W. 


To see how much time this algorithm 
takes it is easiest to work with the 


height of the tree (ie, lg(n)) instead 
of its width. If T(h) denotes the 
worst case time on a 1-PC of height h , 
then 
CODE tee 00 
T(h) s h* logy 73 (2) * [2*T(h/2) + 
B*h | Be rs i) 


The solution of this is 
9((C*h]** [le(h/2 1) , where 
C = 2*[log,;3(2)1**2 , which is 
for any ¢€ > (Q 


O (n*w* =) 


Theorem 3 In o(n**e) time, for any 
e>O , the above algorithm finds the 
k'th item among n items stored in the 

base of a 1-PC of size n. 


We make no claims about the optimal- 
ity of our algorithm. In fact, if one 
uses le(lg(n))/2 stages, instead of the 
2 used above, each determining the 
weighted median of items determined below 
in the previous stage, then one can do 
selection in 
o(lg(n)**[lg(1g(n))/lg(1g(1eg(n))) 1) 
time. Further fine-tuning of this 
approach does not seem very interesting. 
We conjecture that O@(lg(n)) is unat- 
tainable as a worst case time for selec- 
tion, but have been unable to prove this. 
There is also the question of expected 
case time, and we do not even know the 
expected time of our algorithm. 


5. MEDIAN FILTERING 


In this section we consider the 
problem of median filtering of a noisy 
digitized picture stored one pixel per 
PE in the base of a 2-PC . In median 
filtering the idea is to replace each 
pixel with the median of the pixel values 


ina PD* PD square, "' odd, whose 
center is the original pixel. We call 
this square a window. Median filtering 
is a well-known techniaue (see, for 
example, [29] and the references therein) 
and is applicable to data of any dimen- 
sion. Our reason for concentrating on 
the 2-dimensional problem is a paper of 
Tanimoto [16] which gives a simple algo- 
rithm. Our goal is to minimize the run- 
ning time as a function of PD , where we 
show below how to eliminate any depend- 
ence on the size of the 2-PC . We give 
an optimal algorithm, but we must mention 
that our asvmptotic result is misleading 
since in practice 9 is quite small. © 


Tanimoto [16] noted that any depend- 
ence on the size of the pyramid could be 
eliminated by partitioning the image into 
a set of nonoverlapping regions of area 
O(D**2 ) Processors at height [le(D)] 
are viewed as the apex of a subpyramid in 
which filtering is performed, with each 
subpvramid responsible for computing the 
new value for each pixel within it. Win- 
dows around pixels in one subpyramid can 
include pixels from an adjacent subpyra- 
mid, so first all necessary information 
is exchanged between adjacent subpyra- 
mids. This exchange can easily be done 
in 6(D) time. 


Nithin each subpyramid Tanimoto com- 
puted the new pixel values one at a time. 
The total time for this method is 
OOD**2 * T(D)) , where T(D) is the time 
needed to compute the median in a 2-PC 
of size P**2 . The procedure he first 
mentions finds the median in 06(D**2) 
time, giving a total time of 9(N**4) . 
Me also notes that by just sorting in the 
base the median can be found in 6(D) 
time, piving o@(N**3) total time. By 
instead using the selection algorithm of 
the preceecing section one can perform 
median filtering in o(D**[2*e]) time 
for all e> 0. 


However, Tanimoto's approach is not 
very efficient as it ignores the fact 
that calculating the median at one point 
is closely related to calculating the 
median at any adjacent point. By ex- 
ploiting this we are ahle to give an al- 
gorithm taking 9(D) time. A straipht- 
forward data movement analysis, as in 
section 3, shows that this is an optimal 
worst-case time. It seems that this is 
also an optimal expected-case time. 


Theorem 4 Ina 2-PC or a 2-MCC , 
using a D x D window, median filtering 
can be accomplished in 6(D) time. 


The theorem is proved by the follow- 
ing algorithm, which is only briefly 
sketched. We use only the base of a 
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3D-2 


Figure 3 


2-PC , partitioned into D xD _ blocks . 
Within each block each processor will 
determine its new pixel value. Because 
windows of pixels in the block can fall 
outside the block, the block needs to 
know all pixel values in a (2D-1) x (2D-1) 
square called a neighborhood. Also, for 
reasons that will be explained later, the 
block actually must simulate the actions 
of processors lying in a (3D-2) * (3D-2) 
square called a region. (See Figure 3.) 
Fach processor in the block simulates the 
actions of a square of 9 processors from 
the region, with the algorithm being de- 
scribed as if all of the processors in 
the region are assisting the block. 


Via sorting, each neighborhood pro- 
cessor determines the order position of 
its pixel value, e.g., a processor may 
determine that its pixel value is the 
fifth smallest in the neighborhood. 
neighborhood processor x _ sends this 
order position to the form processors 
c1,c2,c3,c4 indicated in Figure 4. 
These four processors are the corners of 
the square of all processors whose win- 
dows include x . Notice that the region 
has been chosen so that if x is in the 
neighborhood then cl1,c2,c3, and c4 all 
lie within the region. 


Each 


Using these corners, given any order 
interval there is a ''spreading wave" pro- 
cess taking 6(D) time, after which each 
processor in the block will know how many 
pixels values in its window fall in the 
order interval. There are 4D**2 -~ 2D + 1 


[x] 
1w, 


Figure 4 


pixels in the neighborhood, so by using 


4) order intervals each has at most D 
values. We repeat the spreading wave 
process 4D times, pipelining the waves 


so that it takes only 9(D) time. When 
finished, each processor in the block 
knows which order interval contains the 
median of its window. The processor also 
knows a relative description of which 
pixel value in the order interval is the 
median of its window, e.g., the processor 
may know that in the appropriate order 
interval it is looking for the third 
smallest pixel value belonging to its 
window. A final combination of sorting 
and searching, also taking 96(D) time, 
finishes the algorithm. 


The algorithm sketched above can be 
modified to provide an optimal 
8(N/log(D)) median filtering algorithms 
using a window of size PD , for 1-dimen- 
sional data, stored in the base of a 


1-PC . It can also be adapted to d-PCs 
and d-MCCs d = 2 , using windows of 
size D**d but the adaptions give 


3 
6(N**(d/2)) algorithms. The only lower 
bound known to the author is 9@(D) , so 
it seems that optimal algorithms for 
higher dimensions will require different 
techniques. 


6. CONCLUSION 


Our results can be rearranged to 
allow a better comparison between pyra- 
mids of different dimensions. If we have 
a d-PC of size n , with one item per 
base processor, then these n items can 
be sorted in O(n/lg(n)) time if ds=1 , 
and @6(n**(1/d)) time if d>1. (This 
holds for both worst case and expected 
case times.) The 1-PC algorithm is 
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new, while the others are well-known, and 
all are optimal to within a constant 
multiple. Our 1-PC sorting algorithm 
depended on a new optimal merging algo- 
rithm. While there was no time to ex- 
plore further here, the 1-PC sorting 
aleorithm gives several other optimal, 
6(n/loz({n)) algorithms. 


We also considered selection, pro- 
viding an algorithm much faster than 
previous ones. Previous algorithms 
utilized sorting, which does not fit the 
pyramid structure very well. Pyramids 
provide logarithmic paths between any two 
processors, but if there is too much data 
movement, such as occurs with sorting, 
then the apex becomes a bottleneck. By 
reducing the amount of data being moved 
we reduced the time to o(n**e) for any 
€é > Q . Thus seletion is faster than 
sorting on serial computers [1], mesh- 
connected computers with broadcasting 
(14] , and pyramids. 


Our selection algorithm also gives a 
faster median filtering algorithm which, 
when using a Dx DPD filtering window, 
takes o(D**[2+e]) time for any —€ >0 . 
However, by making better use of the fact 
that windows of adjacent processors over- 
lap extensively we are able to give an 
optimal 9®(D) algorithm. In a future 
paper the author will consider a variety 
of windowing operations on data stored in 
mesh-connected computers and in pyramid 
computers. 7 
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ABSTRACT -- Omni-sort is a versatile operation proposed to 
perform basic database operations such as union, intersection, 
difference, duplicate-removal, and sorting. We present 
adaptation schemes to implement the omni-sort by modifying 
parallel sorting algorithms. While the modification cost is 
small, the initial expenses of design and test development is 
offset by the increased usage. In the algorithmic point of view, 
the omni-sort also provides potentially much faster algorithms 
for those data processing operations than before. 


1. INTRODUCTION 


Since sorting is one of the most important operations in data 
processing, high-speed parallel sorting methods are very 
desirable. In the last fifteen years, many parallel algorithms 
have been proposed and developed for sorting. Because of the 
advances in VLSI technologies, highly parallel computing 
systems consisting of hundreds of thousands of processing cells 
are feasible. 
may be realized as algorithmically specialized processors [7] or 
as parallel programs in general-purpose computing systems. In 
both cases, initial cost of design and test development is so 
high that it had better be offset by mass production or heavy 
usage. 


In this paper we propose a powerful operation called omni- 
sort‘>) which can selectively perform multiple data processing 
functions. The functions performed by the omni-sort are five 
basic data processing operations: union, intersection, difference, 
duplicate-removal, and sorting [10]. Omni-sort can be 
implemented by slightly modifying the already developed 
parallel sorting algorithms. 


algorithms to be implemented as specialized processors from 
those to be programmed in general-purpose computing systems, 
since the results shown are applicable to both. 


Here we also present schemes to extend sorting algorithms 
to implement the versatile omni-sort. The key point is to 
compose elegantly the sorting capability and some minimal, 
extra computation capabilities. Using the schemes, sorting 
algorithms can be modified to perform the omni-sort such that 
the modification overhead is small and the performance 
overhead is negligible. Based on some sorting algorithms, the 
omni-sort offers faster or even optimal algorithms for 
duplicate-removal, union, intersection, and difference. 


2. THE OMNI-SORT OPERATION 


In this section we first give definitions of the five data 
processing operations and the omni-sort. We then show that 
any sorting algorithm can be extended to perform the omnl- 
sort. A step-by-step synthesis is given to demonstrate what and 
how extra capabilities are attached to a bare sorting algorithm. 
The added-on capabih'ies are endeavored to be easy for 
implementation as well as minimal in introducing overhead. 
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2.1 DEFINITION 


Sorting, duplicate-removal, union, intersection, and 
difference are basic operations in relational database processing 
[10]. One can find their conventional definitions in many 
database textbooks. However, among the five operations, four 
are defined differently here from their sequential counterparts 
except sorting. Omni-sort is a new operation defined to 
perform all of the five operations selectively. 


Let X = {z,, to, .., Za} and Y ={y1, yo, .., ym} be multisets. 
By multiset it is meant that there may be duplicate items. This 
is in contrast to the mathematical definition of a sef in which all 
the elements are distinguishable. From k equivalent data items 
in a multiset, k-1 items are arbitrarily chosen to be duplicates. 
Throughout this paper we will assume that a sorted sequence is 
always in ascending order. The formal definitions of these 
operations are given as below. 


sort( X) to re-order the sequence 2), Tg, °°°, 2q 

such that 291) < 2)(2) S S 290) S 
< yn), where p(1)p(2) --- p(n) 

is called a permutation. 

dup_rm(X) to determine the bit stream 
q(1)q(2) --- a(%) >>> a(n") such that if 
q(*) =1 then z, is a duplicate. 

union( X,Y) to compute the union set of X and Y, 
i.e. dup_rm(X) (J dup_srm(Y). 

inter( X,Y) to compute the intersection set of X and 


Y, i.e. dup_rm(X) () dup_rm(Y). 


dif fer(X,Y) to compute the difference set of X and 


Y, ie. dup_rm( X) - dup_rm(Y). 


to perform one of the above 
operations which is _ selected 
controlled by the argument f. 


five 
and 


omni-sort( f ,X,Y) 


Figure 1. Controlling the functions performed by the omni-sort. 


The duplicate-removal operation defined above simply 
identifies the duplicate items, or marks certain items as 
duplicates. This definition is reasonable for parallel processing 
[4]. If actually eliminating the duplicates is required, an 
additional operation of packing and segregation [6] can be used 
to separate marked and unmarked data items. Alternatively, 
marked data items can be filtered off while outputting the 
sequence. 


The operations union, intersection, and difference defined 
above are more general than standard set operations. They are 
relaxed from the restriction that operands must be sets of 
distinguishable elements. This generalization has its practical 
merit in relational database processing. Multisets are artifacts of 
relational operations like projection and concatenation. Evidently 


many query languages (SEQUAL, QUEL, and QBE) provide 
operations for working with multisets [10]. It is not unusual 
that duplicate-removal and a set operation need to be executed 
subsequently. With the more powerful multiset operations, this 
can be done by invoking only one single operation. Integrating 
two operations together gives opportunity to eliminate the 
communication needed between operations. 


2.2 A GENERAL ADAPTATION SCHEME 


Assume that a multiset X is sorted in ascending order by 
some highly parallel sorting method such that z, < Zo 
oe AP i 


Definition 1. Marking-minus is a parallel process that marks 
certain elements in the sorted multiset X as “minus”. It 
compares all the pairs of z, and 2,4, for 1<i<n; if equal, it 
marks z, as z,. 


A process called marking-plus can be defined similarly on a 


sorted multiset Y. Either marking-minus or marking-plus can 
detect all the duplicate elements in a sorted multiset. For the 
union operation, the distinction between its two operands is not 
important and it can be implemented as duplicate-removal. 


To unify all the five operations, however, there is a problem 
of incompatibility on the number of operands among them; 
distinction of operands is necessary to perform intersection and 
difference. To resolve the incompatibility, we propose that tags 
be attached to data items so that X-elements can always be 
distinguishable from Y-elements. To perform sorting on two 
multisct operands, the operands are simply put together and 
sorted as if the tags were transparent. The use of tags is also 
useful to extend the sorting algorithm to have a feature defined 
below. This feature is crucial to the easy adaptation of sorting 
algorithms to perform the omni-sort. 


Definition 2. Given two multisets X and Y and a sorting 
algorithm to sort them together, the sorting algorithm is called 
quasi-stable if zx precedes (follows) y in the final sorted 
sequence for all z =y, z € X andy € Y. 


We summarize the general procedure to extend any sorting 
algorithm to implement the omni-sort as follows. First, two 
different tags are attached to X- and Y-elements and a quasi- 
stable sorting algorithm is executed on X+Y. The marking- 
plus process then detects all the duplicates in Y and the 
marking-minus detects duplicates in X. Notice that the result 
sequence now has the following pattern: 

i Sara a a | 
where z =yandzé€ X,y€E Y. 


All the equivalent data items, from both X and Y, must be 
marked off except possibly pairs of z and y as shown above. 
The unmarked data z and y must be “neighbors” and only z is 
considered a candidate of valid data in the final result. Then 
some more neighbor-communication and local processing is 
enough to implement the operations intersection and 
difference. 


2.3 IMPLEMENTATION 


According to Section 2.2, a general omni-sort algorithm may 
consist of four processing stages: 


(1) 


Initialization - Each data item is attached a tag to identify 
which operand it belongs to. 


(2) 
(3) 


Sorting - A quasi-stable sorting algorithm is executed. 


Marking - The two marking processes are performed on 
separate operands. 
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Completion - )) ata tags may be further modified to reflect 
the final result of the intended operation. 


(4) 


To implement the omni-sort, we use a two-bit flag for each 
data item. The flag bits are appended to each data item as the 
least significant bits. Initial values of X-tag and Y-tag are 
carefully designated to be (0,1) and (1,0) respectively. With the 
meticulous initialization of the flag bits, any sorting algorithm is 
quasi-stable! For sorting or duplicate-removal, we arbitrarily 
choose X-tag for its single multiset operand. Similarly, the two 
multisets of the union operation can be put together and 
treated as X operand. 


Let a,..494_,4_ be the bit representation of data item a 
where a_,@_2 are the flag bits. The bit representation 
b, ..b96_,)_» is defined similarly for the data item 5. Suppose 
that a and 5} are the neighboring data items, with 6 toward the 
larger end of the sorted sequence. To implement the marking- 
minus or marking-plus, all the pairs of a and 6 are compared 
and the flag bits are modified as following: 


Marking-minus: if a =6 and a_,4_2 = (0,1) 
then a_,a_, = (0,0); 


Marking-plus: if a =6 and 6_,b_2 = (1,0) 
then 6_,6_5 = (1,1); 


Now all the data with the flag bits (0,1) constitute the final 
result of the operation sorting, duplicate-removal, or union. 
However data items with the flag bits (0,1) are just candidates 
for the result of the operation intersection or difference. To 
complete the operation intersection or difference, the program 
segments shown below further screens those data with the flag 
bits (0,1) to produce the correct result. 


/* intersection */ 
if a, ..d9 7 b,..b9 and a_,4_2 = (0,1) and 6_yb_2= (1,0) 
then a_,a_>5 = (0,0); 


/* difference */ | 
if a,..4g = b;,..b6 and a_,a_, = (0,1) and b_,b_» = (1,0) 
then a_,a_2 = (0,0); 


2.4 OVERHEAD 


The modification overhead of the general adaptation scheme 
to extend a sorting algorithm to perform the omni-sort is the 
following: 


e two flag bits for each data item, 
e some bit manipulation capability of the flag bits, 


e maybe some additional capability for performing the local 
pair-wise comparison, and 


e maybe the linear interconnection between processing cells 
for the neighbor-communication. 

The last two items are not necessarily overhead. Comparison 
function is usually needed to implement sorting anyway and 
linear interconnection of processing cells is available for many 
sorting systems. The first two items affect only local circuit; 
they never create significant overhead since it is_ the 
interconnection circuit for communication that occupies most 
of the chip area. 


The time overhead is even more exciting. The initialization 
stage takes only constant time. The marking stage and the 
completion stage also take only constant time if the 
interconnection is available for the neighbor-communication. 
This time overhead is negligible as compared with the 
minimum requirement of O(logo(n+m)) time for sorting X+Y 
(11). 


3. EXAMPLES 


In this section we discuss more specific adaptation schemes 
to turn sorting algorithms into an omni-sorter. Two families of 
sorting methods (enumeration sort and merge sort) and one 
type of linear-time sorting systems are considered. From the 
two sorting methods, we can derive faster-than-ever omni-sort 
algorithms for duplicate-removal and the three multiset 
operations. The linear-time sorting systems, which emphasize 
overlapping sequential I/O time and sorting time, provide a 
most feasible and practical solution to high-speed hardware 
sorter. 


The general adaptation scheme shown in the previous 
section implies four processing stages, with the marking stage 
distinctly separated from the sorting stage. We call this the 
open-form scheme. For the two families of sorting methods, 
closed-form schemes requiring less overhead than the open- 
form scheme are presented. The _ close-form schemes 
completely integrate the marking stage into the sorting stage by 
primarily using anew comparison function in the sorting stage. 


3.1 ENUMERATION SORT 


In the enumeration sorting methods, each data is compared 
with all the others [3]. In other words, they perform at least the 
1/2n{n-1) comparisons to sort a sequence of n data items. 
Usually enumeration sorting methods consist of two 
computation phases. 


e Rank computation - From the result of the l/gn(n-1) 
comparisons the ranks of data items are determined. 


e Data rearrangement - Each data is moved to its final 
position according to its rank. 


Of course the open-form scheme can be applied to adapt any 
enumeration sorting algorithm to become omni-sort algorithm. 
It might be easier for enumeration sorting systems to 
incorporate the marking capabilities into the sorting stage. The 
following program describes how easily the marking capabilities 
can be embedded in the rank computation phase. 


1. fori = 1tondo rank,:=1;  /* the smallest */ 
2. for: = 1 to n do 
for 7 = i+1 to n do 
3. if z, > 2, then rank; :=rank, + 1 
else rank, :=rank, + 1; 
4, if z, =z, then 
if @_,a_. = (0,1) 
then a_,¢_. = (0,0); 
if @_,a_» = (1,0) 
then 6_,b_, = (1,1); 


ra =z, +] 
/* marking-minus */ 
[+b =z, +/ 

/* marking-plus */ 


Assume that all the data are tagged with the flag bits (0,1) or 
(1,0) appropriately in the initialization stage. Step 4 of the 
above program essentially performs the two marking functions. 
In step 4 the modification of flag bits is carefully programmed 
to be consistent with the rank computation in step 3. 


In [5] a sorter based on the enumeration method was 
shown. The sorter is optimal in the sense of reaching the lower 
bound time complexity O(loge(n-+m)) [11]. Using this kind of 
sorter as the base for omni-sort, all the five operations would 
be executed in optimal time. 


3.2 MERGE SORT 


In the merge sorting methods, smaller sorted sub-sequences 
are merged into larger ones and the merging process is repeated 
until there is only one final sorted sequence left. To merge 
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sub-sequences, pair-wise comparisons are performed in parallel. 
A standard pair-wise comparison is usually defined by two 


operators, maz and min, as shown in the following [3]. 


>» max{a,b) 


bia min(a,b) 


=a; min(a,b) :=6 } 


ifa > b then { maz(a,d) : 
) :==b; min(a,b) :=a } 


else { maz(a,b 


In Definition 3, we define a new pair-wise comparison function, 
compare-and-mark, which combines the two marking 
mechanisms with the above comparison function. By using the 
the compare-and-mark function in a merge-oriented sorting 
algorithm, we can prove the result described in Theorem 1 by 
induction [2]. 


Definition 3. Compare-and-mark is a comparison operation 
that manipulates the flag bits as below and also performs the 
standard comparison function. 


if a =} and a_,a_2 = (0,1) then a_,a_.:= (0,0); 
if @ =b and a_,a_2 = (1,0) then a_,a_2 := (1,1); 


Theorem 1. Using the compare-and-mark operation, any merge 
sorting algorithm can sort in the quasi-stable manner and 
perform the marking-plus and marking-minus functions 
correctly. 


The marking processes performed by the compare-and-mark 
function in a merge-oriented sorting algorithm are idempotent; 
they are depicted as a state diagram in Figure 2. Once data 
items reach the state (0,0) or (1,1), they remain in that state. 


idempotent 
marking 
x 
y idempotent 
marking 


Figure 2. State diagram for the marking processes. 


Without using sorting, algorithms to perform the operations 
such as duplicate-removal, union, intersection, and difference 
were proposed on systolic arrays [4] and a tree machine [8]. 
Those algorithms all require linear time O(n+m). Batcher’s 
bitonic merge sort [1] were adapted to several computing 
systems with different interconnection patterns. Even with the 
mesh interconnection [9], omni-sort promises sublinear time 
performance O(/n+m) for those operations. If a large 
bandwidth of I/O is provided then omni-sort algorithms are 
definitely preferred to those in [4,8]. 


3.3 LINEAR-TIME SORTER 


Here the linear-time sorter refers to those which inputs data 
sequentially and produces a sorted sequence while outputting it. 
The open-form scheme can be applied to adapt the linear-time 
sorter to the omni-sorter. First, data items are properly tagged 
and fed into the linear-time sorter. Then the computation in 


the marking and the completion stages is performed while 
outputting the sequence. This can be easily done by adding to 
the sorter a special post-processor. 


4, OTHER APPLICATIONS 


In addition to its power of performing multiple functions, 
the omni-sort has other useful applications. We describe briefly 
in this section its two important applications in relational 
database processing. One is to pre-condition data items for 
carrying out the equi-join operation efficiently. The other is to 
process a class of relational queries involving only the five 
operations in a novel way. Readers interested may refer to [2] 
for more details. 


Sorting has been very useful in improving the performance 
of the join operations (especially the equi-join operation.) A 
common approach is to presort database relations A and B on 
attributes T if they are to be joined on the attributes T. Then 
a join algorithm can take advantage of the well ordered 
sequences in A and B. Now, since omni-sort has the nice 
feature of being quasi-stable, the relations A and B can be 
sorted together on the attributes T. All the database tuples that 
should be joined therefore are grouped together as aggregates. 
Besides, all the A tuples in an aggregate precede their to-be- 
joined B tuples. Then a simple process to shift and join tuples 
is enough to implement the equi-join operation. 


We demonstrate the application of the omni-sort to query 
processing by an example. Suppose that a database query 
involves four relations A, B, C, and D, and requires the 
processing of (I1(A) _J) B)- (II(C) FJ N(D)), where IT is 
the projection operation with the requirement to remove 
duplicates. The processing of this query can be implemented 
by a merge-oriented omni-sort as follows: 


(1) Ry, =omni-sort(f,,T(A),B) and 
Ry =omni-sort( f,f1(C),T(D)); 
(2) Rs =omni-merge(f3,R,Ro). 


In the first step two runs of omni-sort on different operands can 
be executed in parallel, where II is the projection operation 
without the requirement to remove duplicates. The control 
argument f, should be set to select X-tag and the union 
operation, and f. to select Y-tag and the intersection operation 
(Section 2.3). Then in the second step a merge stage is 
sufficient to perform the difference operation since the 
intermediate results from the first step are already ordered 
sub-sequences and appropriately tagged. 


5. SUMMARY 


Omni-sort is a new operation defined to selectively perform 
the five important database operations: union, intersection, 


difference, duplicate-removal, and sorting. Adaptation schemes 
are presented to modify sorting algorithms to implement the 
omni-sort operation. We show a general adaptation scheme 
that is independent of the sorting algorithm and discuss its 
implementation. We also present specific schemes for two 
families of sorting methods and a type of linear-time sorter. 


A highly parallel sorting algorithm may be proposed to be 
implemented as a specialized processor or to be programmed in 
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a general-purpose computing system. In the former case, the 
adaptation schemes presented in this paper can be applied to 
modify the sorting circuit to implement the omni-sort. In the 
latter case, the parallel program for sorting can be extended to 
perform the omni-sort in accordance with the adaptation 
schemes. The benefit is to offset the initial expenses of design, 
development, and testing of a sorting system by increasing its 
usage. 


Sorting is one of the most important data processing 
operations. Many sorting algorithms have been invented and 
developed. The modification cost is small to extend the sorting 
algorithms to implement the omni-sort using the adaptation 
schemes proposed. Besides, the performance overhead is 
negligible; an omni-sort algorithm should share the same time 
complexity as its underlying sorting algorithm. Based on some 
parallel sorting algorithms, omni-sort also provides faster-than- 
ever algorithms for performing the other four data processing 
operations. 
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Abstract -- The task of correlating a large number of radar reports 
with a large number of stored radar tracks has long been identified as a 
natural application of Single Instruction Multiple Data (SIMD) computers. 
As practical/deployable parallel processors are often limited in power 
and number of processing elements, searching techniques have been 
sought that would provide faster execution times than that obtained by 
brute force searches in an SIMD machine. Specifically we sought some 
means of utilizing the power of the parallel processor in the creation 
and searching of a data structure that would give us a species of 
associativity such as enjoyed by a machine with a far larger number of 
processing elements (PEs). 

A data structure and set of searching algorithms have been developed 
which we name Pseudo Associative Linking (PAL). With this technique 
one can, in effect, multiply the power of the SIMD machine on itself. The 
developed algorithms have been successfully used in the implementation of 
the Litton Advanced Tracking System (ATS) developed for Rome Air 
Development Center which uses the 16 element Tracking Array Processor 
(TAP) as the parallel processor. The Litton ATS has already demonstrated 
a track processing capability of over 10,000 tracks. 


1.0 Background 

A need has arisen for radar trackers which will not only track 
hundreds and even thousands of radar targets but perform this task 
without producing an untenable number of false alarms. Such a system 
has never truly been implemented as a standalone system. 

The Tracking Array Processor (TAP) [1] was originally designed 
with the object of providing a physically smal! computer which would 
be able to perform tracking in the ‘track while scan’ sense on up to 2000 
radar tracks. It soon became apparant that, despite possessing a parallel 
processor, achieving these speeds might be possible if and only if we used 
some searching technique analogous to the double linked lists employed in 
sequential computers and sort on both azimuth sectors and cartesian sort 
boxes. Such techniques can be implemented in SIMD machines and can 
be effective but they suffer from the bane of susceptibility to soft memory 
failures and they produce messy/complex programs. This latter flaw is not 
trivial. We thus early on began seeking an alternate to the well known 
sequential computer searching techniques. 

The answer that we found was the Pseudo Associative Link (PAL) 
memory system. The name derived from the clearly ‘pseudo associative’ 
nature of any searching algorithm that requires programming coupled with 
the notion of linked lists. The technique to be described has been 
implemented in a number of different fashions within the operational 
Advanced Tracking System (ATS) built by Litton and tested at Rome 
Air Development Center. The PAL technique has in fact allowed us to not 
only reach the 2000 track number but to implement a system capable of 
dealing with well over 10,000 tracks! 


2.0 The Tracking Array Processor (TAP) 

The Tracking Array Processor is a classic SIMD machine designed 
expressly for the radar tracking problem. During its inception there was no 
interest in applying this machine to general purpose problems. Rather, it’s 
design followed studies which demonstrated that an array of processors 
might provide an optimum method of doing radar tracking. Thus, the TAP 
consists of a maximum of 16 PEs, each possessing 64 K of dynamic RAM 


0190-3918/83/0000/0226$01.00 © 1983 IEEE 


226 


memory and each processing 16 bit data words at a speed of from 3 to 
5 MIPs. 

Since the TAP’s function is the parallel processing of disparate data 
sets an extremely simple interconnect network (a linear bus) was deemed 
and later found to be acceptable. Programming is performed in a high 
level, english-like assembly language [2]. The language and the archi- 
tecture allow, at each step, for selective enabling of any combination of 
the PEs and the execution of conditional instructions based on polled 
responses from the enabled PEs. A WAIT statement allows any given PE 
to bypass batches of code based on local tested conditions. Included in 
the instruction set are Top Down programming constructs such as RE- 
PEAT-UNTIL, WHILE statements and IF THEN ELSE. 


3.0 A Type | PAL 

Consider a system which consists of a 2-dimensional array of 
memory elements, N rows by m columns. An SIMD processor of m PEs 
with each PE containing N addresses of memory forms such a 
2-dimensional memory (See Figure 1). When characterizing the SIMD 
machine as 2-D memory, the Arithemetic and control logic of each PE 
functions only to feed and care for the memory elements in a given 
column. 


PE LOGIC: 
16 REGS, 
FLAGS 
ALU + 
0 
16384 
PE - 
MEMORY 
32768 8192 16 K 
MINIMUM 
_ Dae ae Aca ek ice 
65536 16384 
83919-1 


Figure 1. Tracking Array Processor 2-D Memory Structure 


A section of the parallel processor memory will be designated a PAL 
memory. Nominally this section of memory will consist of N1 rows, each 
one memory cell deep by m columns wide where m is the number of PEs. 
The PAL memory will be used to point at rows in parallel processor 
memory defined as the object memory (See Figure 2). This memory is 
used to store the data being sought. Thus each row in this memory is 
N2 rows deep. If N3 data items are to be maintained in this memory then 
the number of actual rows in memory is (N2) (N3) /m. The PAL memory 
is required however to point at only N3/m rows. 
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Figure 2, PAL Memory to Object Memory Link Structure 
3.1 Description of the Type | PAL 

We now make the following definitions: 

1. We here assume that association will be performed for one 

field of the data only. 
All data to be stored in a row of the object memory will be 
loosely associated. (e.g. if the data to be associated is, say, 
radar azimuth, then all data stored in the same row might be 
contained within a 10 degree azimuth wedge). 

Upon initializing the system, each i, j memory location of the PAL 
memory is placed into permanent 1:1 correspondence with a row of the 
object memory. This is performed by placing the actual address of the row 
of the object memory into the PAL memory’s location. Additional data 
stored in this memory location are (See Figure 3): 

a. A bit which indicates if this PAL location ( = row in object 

memory) is in use. 

A bit which indicates if the row in object memory is full. 
A field which defines the value of the association data being 
stored in the row in the object memory. 

(example: if the data is radar tracks and we are storing them in 
a 5.25 degree azimuth sector then the value of the association 
data will be a 6 bit word defining one of the 64 azimuth 
sectors.) 


ea 4.8 1 


AZIMUTH PERMANENT 


ROW INDEX 


SECTOR NAME 


1 = ASSOCIATED ROW IS FULL 
Pen ERS 0 = ASSOCIATED ROW IS NOT FULL 
| = ASSOCIATED ROW HAS BEEN 
ASSIGNED TO THE AZINUTH 
IN USE FLAG SECTOR NAME 
0 = ASSOCIATED ROW IS 


UNASSIGNED 
83919-3 


Figure 3. PAL Memory Word Format 
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Following system initialization the object memory is empty and this 
of course is indicated by the flag bits in the PAL memory. When the first 
datum arrives for storage, the program first looks in the PAL memory to 
see if there is any data with the same ‘name’ (i.e. association field). This 
look is done by reading each row of the PAL memory in parallel and 
looking for a data match. If an ‘’unused”’ bit is found set during this 
search, the search can stop. Thus if the memory is empty, the search stops 
immediately as inspection of the first row of the PAL memory indicates 
no match on the association field value and at least one empty row flag 
set. 

If there is no association field match, the next available i, |; memory 
location in the PAL memory is assigned. This is carried out by, 


a. setting the ‘‘in use”’ flag in the PAL i, j location, and 
b. inserting the association value into the PAL location’s assoc- 
iation field. 


The pointer into the object memory is now extracted from the PAL 
memory location, and the indicated row in the object memory is examined 
to determine where the next empty location exists (there must be at least 
one). The new data is now written into the next empty location in the 
indicated row in the object memory. If this location fills the row the 
“full” bit in the PAL flag word is set true. 

In this fashion each piece of data which lies in the same azimuth 
sector will be stored in rows in the object memory such that all data items 
in this row are loosely associated. The number of rows assigned to any 
particular azimuth will of course be a random function of the input data. 
It is to be hoped that the association variable has been chosen such that 
the likelihood of input data being concentrated on one or just a few values 
is small. In other words, this value must be picked with the advanced 
intelligent expectation that the data has a high probability of spreading 
over a number of values. 

The principal virtue of the PAL technique comes not from the ease 
of finding data but in the speed of rejecting data! Consider the act of 
finding a particular track lying within azimuth sector S. We first look at 
row 1 of the PAL memory and ask if any of the locations are pointing to a 
row in the object memory containing azimuth sector S data. If the answer 
is no, we look at row 2 of the PAL and ask the question again. If we get 
the answer that there is at least one, we may check the first candidate row 
that comes to view. 

Checking the candidates means going into the object memory at the 
indicated row and checking m candidates simultaneously. If the sought for 
datum is not found, we go back to the row of the PAL memory and see if 
the row we are in points to any more candiate rows in the object memory. 
If so, we check them. The important thing is that; 

a. If the PAL memory does not point to any candidate rows in 
the object memory, then we have in fact checked and discarded m x m 
memory locations in the object memory! 

b. If a row contains just one pointer to a candidate row in the 
object memory then we have excluded m x (m-1) memory locations and 
now go to perform a detailed check of the single candidate row at the 
normal speed up factor of m (the number of PEs). 

Thus if we can guarantee the association parameter produces values 
which spread over some range we can search the parallel processor at a 
speed somewhat less than m squared times an equivalent sequential 
computer and considerably greater than m times the equivalent sequential 
computer. 


3.2 Basic PAL Operation 


Each time data is to be accessed from a PAL governed memory one 
goes into the PAL itself and reads each row of the PAL until either the end 
of the PAL is reached or a PAL location is detected which contains an 
unused object row flag. 

When data is to be stuffed into a PAL governed memory one must 
read every row of the PAL memory until a location is found which points 
to a row in object memory which is not full and possesses the same 
association value. If no such row is found, one must be created. 

If the data stored in the object memory has an association value that 
changes due to periodic access and recalculation, then upon each 
recalculation this. data may have to be moved within the PAL memory. 


3.3 Garbage Collection in a PAL System 

In order to use a parallel processor effectively it is necessary that 
virtually all PEs are utilized a very high percentage of the time. Thus, 
during searches we wish to guarantee that every PE, at every step, contains 
data that is maximally likely to be associated with neighboring PE’s data. 
A similar consideration applies to the act of computing on data in object 
memory. 

The PAL memory structure has as its purpose the job of 
guaranteeing that data will be optimally stored in memory so as to allow 
parallel! access. If data were only loaded into memory, the PAL system so 
far described would suffice. In the real world, however, data must be 
randomly deleted from memory. If data is deleted from a row in the 
object memory that was formerly full then the system following the 
deletion, is left with two rows in the object memory containing data with 
the same association value which are not full. Now clearly if there are m 
PEs and the data is effectively random over the association value then 
there is a probability of (m-1)/m that on loading the PAL/object memory 
that one row of PAL object memory or any association value will not be 
full. This is unavoidable. If there is a great deal of data we are simply stuck 
with one of many rows which is not full and hence processor effectiveness 
is only slightly impaired. 

If however, there is more than one row not full per each association 
value our efficiency could fall off dramatically; in fact, we have defeated 
the purpose of the PAL system. As data continues to be randomly deleted 
from the system we could reach a state where a great many rows contained 
but a single datum all with the same association value. 

All of this merely goes to say that a form of garbage collection must 
be implemented. In this case we perform the following algorithm on any 
data deletion: 


3.4 Garbage Collection Algorithm 

1. Upon finding and deleting the data, test this row in the object 
memory to determine if it was already less than full. If so, test if this 
deletion empties the row. If emptied, return to the PAL and reset the ‘in 
use’ bit and exit. 

2. If the object memory row was full before the deletion then 
search the PAL memory to find if there exists a row in object memory 
which is less then full. 

Remove a datum from the row which is already less then full and 
move this datum so as to fill up the hole just made by the deleted datum 
and check if this deletion now makes the row from which it was removed 
empty; if so, clear the association value from the linked PAL location and 
set the full/empty flag to empty. | 


3.5 Computation Times For A Type | PAL 

We shail continue to draw our examples from the radar case which 
after all was the heuristic behind this work. Let us consider the case where 
a type | PAL is storing radar data and is accessed by azimuth wedges. Then 


Trt = (Nr) [Nt/(Aw/Ac)] (Os) (1/m) (T1) (1) 
where, 

Trt = the plot to track correlation time 

Nr = the no. of reports 

Nt = the no. of tracks 

Aw = the azimuth wedge containing all data. 

Ac = the azimuth cell size 

m = the number of PEs 

Os = overlap scalar 

T1 = the time to correlate one report against one track 


Obviously computation time is a function of how the targets tend to 
cluster in azimuth cells. (NOTE: As targets may lie on the edge of azimuth 
sectors it is often necessary to search more than one sector. The overlap 
scalar (Os) indicates the number of sectors that must be searched.) 

To illustrate the timing we shall assume Nr = Nt = 2000, Ac = 3 
degrees, Os = 2.25, and T1 = 10 microseconds. Figure 4 was generated by 
assuming that all of the tracks were distributed randomly in a fixed 
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azimuth wedge as indicated on the abcissa. Thus if the data is randomly 
distributed over the entire 360 degrees we have the point(s) on the far 
right of the figure. If all the tracks are contained within a 120 degree 
wedge, we get the indicated point. 
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Figure 4. Report to Track Correlation Times for a Type | PAL Organized 
with Report/Track Azimuth as the Association Field. (m = the no. of PEs) 


4.0 A Type Ii PAL 

Suppose we have quite a large number of tracks and try to perform 
our searching with a type | PAL. Certainly there is a point, depending 
upon our system constraints at which we will run into problems. Assume, 
for example, that we must process 20,000 trial tracks and there are 3000 
radar reports which must be correlated with these tracks. As a method of 
comparison let us compare the time to search the entire track memory for 
correlation if there exists no matching association field. If we use a type | 
PAL, then; 


Tnc1 = (Nr) [Nt/m2] tc (2) 


where: 
Tne = the time to search a type | PAL and get 
NO correlations. 
Nr = the number of reports 
Nt = the number of tracks 
m = the number of PEs 
tc = the time to check and reject one 
row of the PAL 
if Nr = 3000 reports and N = 20000 tracks and m = 16 and tc = 3 
microseconds then, 
tne = (3000) [20000/(256) ] (3) 


(.000003) = 0.7 seconds 


Can a PAL operate on another PAL system so as to produce another 
order of magnitude of improvement? The answer is yes and this system we 
arbitrarily refer to as a type I! PAL. Besides being of intellectual interest, 
the type || PAL became of practical necessity when the number of tracks 
that we had to process rose beyond the several thousand level. At this 
number of tracks we had but 16 PEs to play with and we found that the 
overhead operation of searching for empty rows in the object memory and 
the basic search was consuming far too much time. 

Consider the system illustrated in Figure 5. Track data is stored in 
cartesian boxes in the object memory defined by an association field 
which specifies 16 mile by 16 mile squares in the x-y plane. There are two 
PAL memories, a level O PAL and a level 1 PAL. Each memory location of 
the level O PAL points to a row in the level 1 PAL. And every memory 
location of the level 1 PAL points to a row in the object memory. 
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Figure 5. A Type It PAL Structure 


As with the type 1 PAL each row in the object memory contains 
data which is loosely associated by the assocation field. Each such row is 
pointed to by a memory location in the level | PAL. By our definition, 
each row of the level | PAL is allowed to point to the rows in the object 
memory containing the same sort box (i.e. 16 mi. x 16 mi. area). This 
defines a maximum of m x m tracks that may be contained in any given 
sort box. 

Each row of the level O PAL is addressed by a truncated version of 
the sort box address. Thus if there are 16 PEs, each row of the level 0 PAL 
points to a 4096 sq. mi. region. 

In summary, we provide in excess of 20,000 memory locations in 
the object memory arranged as 16 columns (PEs) by at least 1250 rows. 
Each row in the object memory is pointed to by a location in the level 1 
PAL. Each row in the level 1 PAL is dedicated to a single 16 mi. x 16 mi. 
sort box in the correlation space. 

Each location of the level O PAL points to a single row of the level 1 
PAL and consequently is pointing to 256 tracks. If there are 20,000 
memory locations then there are least 20,000/m rows in the object 
memory. If m, the number of PEs, is equal to 16, then the number of rows 
is 79. Assume the maximum range is 256 miles. Thus as each element of 
the level 1 PAL points to 16 16 mi. x 16 mi. sort boxes there must be 
1250/16 = 78.125 rows, implying 79 rows, in the level 1 PAL. As each row 
of the level 0 PAL points to 16 16 mi. x 16 mi. sort boxes there must be 
79/16 = 4.94 rows, implying 5 rows, in the level O PAL. 

Let us consider now the same timing problem that brought us into 
this solution. The timing formula to search the entire track file is now 


Tnc1 = (Nr) [Nt/(m3)] (tc) (4) 
If Nr=3000, Nt = 20,000, m= 16 and tc = 3 x 10-6 sec., 
then, 

Tncl = (3000) [(20000)/4096] (5) 


(.000003) = 0.029 seconds 


Let us now directly compare the PAL | and the PAL I!I systems 
utilizing the task of performing radar report to stored track correlation. As 
a means of making our comparison, we for purposes of analysis, shall 
assume that all reports and tracks are constrained to lie within a defined 
azimuth wedge whose size, we, in our analysis can vary. The targets/tracks 
shall be assumed to be uniformly and randomly distributed throughout the 
entire defined wedge. To facilitate our analysis we have calculated the 
area contained in each such wedge and divided the area of the cartesian 
sort box into this wedge area to obtain the number of sort boxes, 
containing the track/report data. To further simplify matters the number 
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of tracks per sort box is assumed to be the total number of tracks divided 
by the number of sort boxes. This calculation is expressed by the formula: 


Ntsb = Nt/[2(7) (R2) (Aw/360 deg)/(sort box area) ] (6) 
where, 
Ntsb = the No. of sort boxes 
R = the maximum range of the radar 
Aw = the angular width of the azimuth wedge in degrees 
Nt = the total number of tracks currently active in the 


system 


Any formula for the computation time expended in correlating radar 
reports with radar tracks must address two principal tasks; 

a) Every report must scan the entire PAL system to find any 
possible links. In the PAL | system, given that there are a maximum of 
20,000 tracks, each report must look at 20,000/ (m2) rows in the PAL 
memory. If m = 16, then 79 rows in the PAL memory must be examined 
by every report. 

b) Every report that finds a link into the track memory must be 
correlated against at least one row of m tracks, with m = the number of 
PEs. 

The correlation time for the PAL | system is expressed by: 

Tr = Nr [(Ntsb/m) (Os) (T1 x T2) + 
(Nmx/(m2)) T2] 


For the PAL II system, 


Tr2 = Nr [(Ntsb/m) (Os) (T1) + (8) 
(Nmx/(m3)) T2] 
Where; 
Nr = The no. of reports 
Ntsb = the no. of tracks per sort box 
T1 = the time to correlate one report against one row 
of tracks 
T2 = the time to check one row of a PAL memory 
Nmx = the maximum no. of tracks the system can hold 
m = the no. of PEs 
Os = The overlap scalar 


Figures 6a and 6b plot the correlation time for both of these PALs 
given Nr = 3000 reports, Nt = 10,000 tracks, Nmx = 20,000 tracks, Os = 
2.25, T1 = 20 microseconds and t2 = 4 microseconds. Figure 7 provides a 
direct comparison of the results for the two PAL systems with the number 
of PEs set to 16. 
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Figure 6A. Type | PAL Results 
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Figure 6B. Type I! PAL Results 


Figure 6. Plots of Report to Track Correlation Time for Type | 
and Type Il PALs Using Target x, y as the Association Variables. 


5.0 A Type Il PAL 

The data structure described in the previous section is clearly faster 
than the Type | PAL yet in the tracking application it contains a not 
insignificant defect. By only allowing 79 rows (in our example) in the PAL 
| memory we in fact are allowing only 79 sort boxes. There are, in fact, 
nearly 1000 possible sort boxes. Once our allowed 79 have been assigned, 
any new track falling outside of this set is not accommodated in the just 
stated system. One could have a very sparse track load yet not be tracking 
all the aircraft. 

The solution is our third and last PAL system; the Type II! PAL 
structure. In this system we provide a PAL structure wherein there exists 
a row in the PAL | memory (here called the PAL S1 memory) for every 
possible sort box. Each row in this PAL S1 memory may now be addressed 
directly by the data field as the data field is the address of a row. The 
individual locations within the PAL S1 memory row are not linked 
permanently to rows in the object memory as was the case in the Type | 
and Type II systems. Instead, such a linking is made if and only if a datum 
occurs within the indicated sort box. This means that upon each new 
track entering the system, we must search every row of the object memory 
and find the first unassigned row in the object memory and assign it to the 
appropriate location in the PAL S1 memory. Since there are at least 1200 
rows in the object memory this search itself becomes the major time sink. 
Thus we complete this system by incorporating a Type I! PAL structure 
into this system whose sole purpose is to point to unused rows in the 
object memory. Figure 7 illustrates the complete Type iI! PAL system. 
The included Type I! PAL structure (PAL UO and PAL U1) upon system 
initialization permanently link themselves to the object memory such that 
each location of PAL U1 points to a single row of the object memory and 
each location of PAL UO points to a single row of the PAL U1 memory. 
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Figure 7. A Type {11 PAL Structure — The Sort Box PAL and PALO are 
Used to Correlate a Report Against a Track. The PAL Structure is 
Used to Locate Empty Rows in the Object Memory. 


When searching for unused object memory rows the alogorithm first looks 
in the PAL UO memory. If there are any unused rows in the first 256 rows 
of the object memory at least one location in this first row will have a bit 
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set indicating the fact. The algorithm then draws out the first indicated 
word from the PAL U1 memory and at least one of its locations (PEs) 
will point to an unused row in the object memory. Upon using the row, 
the algorithm now resets the ‘unused row’ bit in the PAL U1 location 
indicating the linked row is now in use. If this was the last location of 
that row in the PAL U1 which had been empty, the appropriate flag bit 
must be reset in the PAL UO row/column location. 

The actual report-to-track correlation time is now very rapid as the 
algorithm is able to address the PAL S1 memory directly which in turn is 
pointing directly to all rows in the object memory containing tracks in the 
referenced sort box. 


The correlation time for the Type II] PAL system is 


Tr3 = Nr (Ntsb/m) (Os) (T1) (9) 
Using the UO and U1 structure to find an unused row in the object 
memory occurs only upon a new track being loaded into the system or 
when an existing track moves into a previously unused sort box. The 
computation time to find a unused row in the object memory is, 


Tns = (T2) (10) 


m 


For Nmx = 20000, m= 16 and T2 = 4 microseconds, we have Tns = 
19.5 microseconds. 

Figure 8 indicates the performance of the Type If PAL and Figure 9 
contrasts the performance of the three PAL types. Both of these plots 
were generated using the same input conditions as the plots of Figure 6a 
and 6b. 
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Figure 8. Plots of report to track correlation time for a Type II] PAL 
using target x,y as the association variable. 
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Figure 9. Performance contrasts for the three types of 
PAL structures described. 


6.0 Summary 

A natural application of an SIMD machine is in parallel searches in 
algorithmic operations such as correlation, where a large amount of input 
data must be associated with a mass of stored data where there is ideally 
only one match for every stored datum. If one is blessed with an enormous 
number of processing elements and memory then there is no problem; one 
may attack the problem in a truly parallel sense. Real world cost and 
finite size limitations usually restrict the number of processing elements to 
far less than a desired number. 

The PAL technique was devised as a practical method where the 
inherant power of the parallel processor might be multiplied by itself one 
or two times producing very great effective speedups. This is essentially 
accomplished by trading off memory which is now cheap against 
processing elements which are still not cheap. The PAL system allocates a 
section of two dimensional memory (The PAL memory) wherin each 
element points permanently at a row elsewhere in the 2-D memory (called 
the object memory). The contents of each row in the object memory are 
loosely associated under the association variable. By testing a row in the 
PAL memory and not finding a match, m rows of the object memory may 
be discarded and consequently m x m potential candidates may be 
discarded. The key element of the trick is obviously choosing the 
association variable so that at any time a correlation candidate only finds a 
few rows against which to be correlated. 

The PAL technique has been implemented in a now operational 
radar tracker (The Advanced Tracking System) utilizing a 16 PE parallel 
processor, the TAP. The result has been a machine which is able to 
maintain over 10,000 radar trial tracks and is consequently able to tolerate 
considerably higher input false alarm levels than here-to-fore possible. The 
resulting tracker is consequently able to maintain over 1000 actual tracks 
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in very noisy environments and has substantially altered the expectations 
of tracking systems and changed the notion of their limitations. 


7.0 Acknowledgements 
The first programmed implementation of the PAL technique was 
performed by Mr. Daie Frederick who also helped design the TAP 
programming language. The language and compiler design were completed 
by Mr. Leigh Murphy who, even more significantly, was the first person to 
utilize the PAL type | and type II data structures and algorithms in the 
construction of the operational Litton Advanced Tracking System. 


8.0 References and Bibliography 
(1) F.P. Hiner IIf, F. Erickson, D. Frederick, D. Johnson, ‘The Tracking 
Array Processor’’, Second Tactical Air Surveillance and Control 
Conference, RADC, Griffiss Air Force Base, New York, 1980. 


(2) F. Erickson. A description of the Building Block Signal Processor 
Assembly Language (BUBAL). Litton internal Report. 


(3) F. Hiner Ill, “The Tracking Array Processor’’, Proceedings of the 
1978 International Conference on Parallel Processing’’. 


(4) F.P. Hiner ttl, ‘The 12000, A Remote Radar Tracking Station,’’ 
International Conference ‘‘Radar-77’’ London, Oct 1977. 


(5) H. Gregory Schmitz and Cheng-chi Huang, “An Efficient 
Implementation of conflict Prediction in a Parallel Processor’, Lecture 
Notes in Computer Science, pg. 383-399, 1974, Springer-Verlag. 


IMPLEMENTATION OF AN ARRAY AND VECTOR PROCESSING LANGUAGE 


R.H. Perrott, D. Crookes, 
P. Milligan and W.R.M. Purdy 


Department of Computer Science 
The Queen’s University of Belfast 
Belfast, BI7 INN, N. Ireland 


ABSTRACT -- A compiler for a Pascal based 
language Actus is described. The language is 
suitable for the expression of the type of 
parallelism offered by both array and vector 
processors. The implementation described is for 
the Cray-1l computer. An objective of the 
implementation has been to construct an 
optimizing compiler which can be readily adapted 
for a range of array and vector processors. As a 
result the machine dependent sections of the 
compiler have been clearly identified. 


INDEX TERMS -- array processors, vector 
processors, graph transformations, abstract 
representation, optimization. 


INTRODUCTION 


The programming languages which have been 
designed and implemented for array and vector 
processors rely on strategies which can be 
divided into two categories: 

(i) detection of parallelism, in which a 
programmer constructs a problem solution in 
a sequential programming language (usually 
FORTRAN), and a compiler tries to detect 
any inherent parallelism [3, 6]; 

expression of machine parallelism, in which 
the syntax of the language reflects the 
underlying parallelism of the hardware, 
either directly, or by means of subroutine 
calls [1, 7]. 


(ii) 


A third category has recently been proposed, 
which aims to exploit the parallelism inherent in 
the problem; the program and data structures of 
the language enable a programmer to express 
directly the parallel nature of a problem, 
without reference to specific architectural 
features [4]. 


The implementation of these languages present the 
compiler writer with problems of varying degrees 
of difficulty. In the detection category, the 
compiler must determine which statements can be 
grouped together and executed in parallel. In 
the second category, the syntax assists the task 
of compiler construction, since it is designed to 
map directly onto the instruction set of the 
machine. The third category, which neither 
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relies on there being a direct hardware 
correspondence, nor a detection mechanism, 
requires a new approach for the construction of a 
compiler for these parallel machines. In this 
paper, we consider a compiler which is being 
constructed for the Pascal-based language Actus, 
on the Cray-l1 [6]. The aim has been to construct 
an optimizing compiler which can be used as the 
basis for a range of Actus implementations. As 
this aim ultimately conflicts with the task of 
exploitation of the hardware, the 
machine-dependent sections of the compiler must 
be clearly identified. 


LANGUAGE DETAILS 


Actus provides the scalar structured programming 
concepts of Pascal. In addition, the declaration 
part of a block is used to indicate the maximum 
extent of parallel processing which can be 
applied to a structure. For example,in the 


declarations 
war a, b : array [1:100] of real; 
cc : array [1..3, 10:90] of integer; 


the extent of parallelism (eop) is indicated by 
parallel dots : and is 1:100 for the arrays a and 
b, 10:90 for the array cc. The elements of an 
array such as a may be referenced in parallel by 
means of an expression such as 


ali:j] 


where i <= j and both variables are in the range 
1..100. A non contiguous set of elements may be 
referenced by establishing (static) index sets 
such as 


index 
odds = 1:[2]99; (* 1,3,5,- - -,99 *) 
primes = 2:24+3:34+5:5+7:7411:11413:13417:17; 


and using the index set identifiers as follows: 


alodds] 
bl primes]. 


Other more complex index sets can be established 
during execution of a program by combining index 
sets using the operators + (union), * 
(intersection) and - (difference). 


During the course of program execution, the 
extent of parallel processing can be manipulated 
dynamically by the language statements. Actus 
thus provides the concept of a dynamic eop, which 
can be set and adjusted by the following 
statement types: 


assignment , 
if , 

case, 
while, 

for, 

with, 
within. 


Three examples which are relevant to other 
sections of the paper are: 

(i) if statement 

In a parallel if statement, the eop within the 
then clause refers to the indices of only those 
elements of the parallel condition which were 
true. If there is an else clause, the eop refers 
to the complement set of indices i.e. those 
indices which were false. 


For example: 
if al2:50] > 0.0 then ccli,#] :=0 


The anonymous sharp symbol # is used to represent 
the current eop. In this example it is 
determined by the Boolean expression 


al2:50] > 0.0 


i.e. if al2]) > 0.0 then 2 is in the set # , 
if al3] > 0.0 then 3 is in the set #, etc. 
(ii) while statement 


During a parallel while statement the eop is a 
non-increasing set which is re-evaluated on each 
iteration until it is empty. For example, 


while a[1:100] > 0.0 do 


begin 
b(#] := b[#] + al#] * al#]; 
al#] := al#] - 1.0 

end 


The statements are applied to only those elements 
of a which are currently in the eop and which are 
positive; this determines the set as represented 
by the # on each iteration. 


(iii) within statement 
The within statement can be used to indicate the 
eop for a series of statements in which the eop 


will not change. It sets the eop as follows: 


within 50:70 do 


begin 
al#] := 0.0; 
ecl[2,#] := 1 
end 


233 


If any of these extent setting constructs is 
nested, the eop is stacked as each new construct 
1s entered, and unstacked upon exit. In addition, 
the scalar equivalents of these statements 
(except for the within statement) are available. 


Data alignment 


There are two data alignment operators which can 
be used to move data within a parallel structure 
namely, 


(i) the shift operator which enables the 
alignment of data within the range of the 
declared extent of parallelism. 

For example 


war a, b: array [1:10] of real; 


al1:3] := b[1:3 shift 1] 
assigns 
all] := b[2], al2] := b[3], al3] := b[4]. 


(ii) the rotate operator which enables the data 
to be shifted circularly with respect to the 
extent of parallelism. For example 


af1:3] := b[1:3 rotate 1] 
assigns 
all] := b[2]), al2] := b[3], al3]) :=bl[1]. 


The general form of an index expressions using 
these operators is 

eop alignment operator distance 
the 
and 
can 


where eop is either an explicit definition of 
extent of parallelism or an index set, 
distance is an integer expression whose value 
be positive or negative. 


This brief summary of Actus has described only 
some of the features of the language which are 
relevant to the discussion which follows. 
Further details can be found in [4, 5]. 


STRUCTURE OF THE COMPILER 


The Actus compiler is implemented in Pascal and 
based on a multipass scheme. On the first pass, 
Syntactic and semantic analysis is carried out by 
the analyser and a pseudo-analyser constructs an 
abstract representation of the program. On the 
second pass, the pseudo-analyser transforms this 
representation to include both machine-dependent 
and machine-independent optimizations. Although 
the machine-dependent transformations will be 
described before the machine-independent ones, 
this is not necessarily the order of application. 
Indeed, the machine-dependent transformations are 
generally applied first, because they can often 
introduce additional code structures which 


require optimization. Once the graph has been 
transformed, code synthesis follows. Certain 
aspects of the code generation process have been 
rendered machine-independent by the construction 
of an abstract description of the target machine 
(described in a later section). 


Program representation 


The abstract representation of a source program 
is a graph, which is constructed by the 
pseudo-analyser. However, the graph is not 
designed to reflect exactly the constructs of the 
source program as is usually the case for 
graph-based compilers [2,9]; rather, the 
node-types are chosen so as to facilitate the 
transformations to be effected during code 
synthesis. For example, in Actus several 
structures set the extent of parallelism and 
there are several forms of loops; on the graph 
only one node-type may set the extent of 
parallelism and there is a reduced number of loop 
structures. 

(i) Control flow structures 

There are five semantically distinguishable loop 
structures in Actus, namely the scalar and 
parallel for statement, scalar and parallel while 
statement and the scalar repeat statement. One 
of the loop structures on the graph is equivalent 
to the language construct (not actually available 
in Actus) 


repeat 
statement 
while boolean expression 


which is chosen for the consequential simplicity 
in removing loop invariants. To express a while 
statement in terms of this structure requires 
duplication of the terminating condition in an 
enclosing if statement: 


while test do statement 
becomes 


if test then 
repeat 

statement 

while test 


This transformation would be applied when loop 
invariant removal is applicable. Although it 
will increase the size of the object program, it 
does not impair run time speed. Similar remarks 
apply to representation of for statements. 

(ii) The extent of parallelism 

The pseudo-analyser simplifies references to the 
extent of parallelism on the first pass. The 
following statements illustrate the diversity of 
references to the eop. 
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(1) afli:10] := b[1:10]; 


(2) within 1:10 do al#] := bl#]; 
(3) if all:10] > 0.0 — 
al#] := bl#]; 


In (1), the eop is set explicitly and explicitly 
referenced on every use. In (2), it is set 
explicitly but referenced by implication. In 
(3), its production is based on calculation and 
its reference is again by implication. 


Actus index sets are static and therefore are 
unsuitable for representing the dynamic eop on 
the graph. For this purpose the graph uses a 
"run-time-set" construct, which contains 
information on the base value, stepping value, 
length and regularity of a parallel index. A 
regular parallel index is one which has a 
constant increment between all ordered adjacent 
values. 


In addition to the usual set arithmetic 
operations, the pseudo-analyser uses four 
Operations in constructing and handling 
run-time-sets. These are 


(1) setfrom (index set expression) - converts 
an index set expression to a run-time-set; 


(2) truevalues (vector expression) - returns a 
run-time-set corresponding to those members 
of the current eop which yield "true" when 
substituted for the parallel index in the 
vector expression; 

(3) anymembers (run-time-set) - returns a 
boolean value determined by whether the 
“run-time-set'" is empty or not. 

(4) setcomplement (run-time-set) - complements 


a run-time-set within the current eop. 


"Setcomplement" and “truevalues" are always 
subsets of the run-time-set of the current eop. 
Set calculations are performed by "setassign" 
structures on the graph. 


Given this facility for calculating and storing 
the extent of parallelism, we require a unique 
means for setting the eop and a unique means for 
accessing it. The pseudo-analyser provides a 
"new-eop-scope" structure for setting the eop and 
uses a "sharp" node to refer to it thereafter. 
The new-eop-scope structure has the form 
eop run-time-set do statement 

Using these structures, a parallel form of the if 
statement reduces as shown in the following 
example. 


Example 


Illustrating the graph for a parallel if 


statement. 


("#rts-n#" denotes a run-time-set identified as 
n.) 


if all:10] > 0 then 


al#]) := al#] - 1 
becomes 
begin 
#rts-O# := setfrom (1:[1]10); 
eop #rts-0# do 
begin 
#rts-l# := truevalues (al#] > 0) ; 


if anymembers (#rts-l#) then 
eop #rts-l# do 


begin 
al#] := al#] - 1 ; 
#rts-l# := truevalues (al#] > 0) 
end 
end 
end 


The structures described above are built on the 
first pass of the compiler. Calls to the 
pseudo~analyser are embedded in the corresponding 
analyser recognition procedures. To reduce 
storage requirements, the unit of compilation is 
currently one block. When the graph for a block 
has been constructed, it 1s passed on for further 
transformation and code synthesis. 


Machine-—independent transformations 


The purposes of these transformations can be. 


classified into two main areas: 


(1) allocating storage units to variables; 
(2) program optimization. 


Scalar variable allocation 


The time consuming task of fetching variables 
from memory is one which the compiler must 
eliminate whenever possible. Thus an efficient 
register maintenance scheme is essential. In the 
Actus compiler, storage for scalars is not 
allocated statically. Each time a variavle 
assumes a new value it is allocated fresh storage 
which becomes free immediately the last access to 
that value has been made. When there is 
redundancy it is possible to allocate more scalar 
variables to registers than there are registers 
on the machine. Resulting housekeeping 
operations (such as dumping values to temporary 
store) are kept to a minimum. 


A variable may have many "lifetimes" in its scope 
and thereby be allocated storage many times. 
Consequentially, one major .difficulty in the 
implementation of this scheme is to ensure 
consistency of the allocation, particularly 
through the conditional statements sequence of 
the program. To fulfil this, all nodes 
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corresponding to scalar variables on the graph 
refer the pseudo-analyser to a version number for 
this variable by means of a pointer. Each time a 
new value is assigned to the variable, a new 
reference area containing a new version number is 
created. Until the last reference to this value 
has been made, all nodes for the variable will 
refer to this area. The pair (variable, version 
number) is referred to as a key. Storage is 
allocated to the keys, rather than to variables. 


The consistency problem is now to ensure that 
each variable produces the same key (nodes refer 
to the same version number) whether or not a 
conditional statement sequence is executed. 
Variables which are changed in this statement 
sequence will have been allocated new version 
numbers and are compelled to conform. The 
reference area set up by the last assignment to 
the variable in the statement sequence 18 
rewritten with the version number of the variable 
before the condition was evaluated; all nodes 
referring to this value will therefore produce 
the original key and hence be allocated the same 
storage. 


Particular care has to be taken when allocating 
version numbers to variables in the alternative 
control paths through parallel conditional 
statements (if..then..else, or case). This is 
because each alternative path is executed in turn 
(though each with its own eop), but each path 
assumes that the variables initially have their 


“start-of-statement” values. For imstance, in 
the statement 
if al#] < 0 then 
begin 
1 = @ee 
end 
else 
begin 
j := ..expression using “1".. 
end 


the second assignment must use the original value 
of 1, so the first assignment must not update the 
Original location. In general, one solution to 
this problem is to stipulate that alternative 
paths through such a statement must delay 
updating those assigned variables which are 
referenced in later paths until the end of the 
statement. 


Access of structured variables, such as arrays or 
records, is controlled by address calculations 
represented explicitly in the graph. Such 
schemes are commonplace and will not be discussed 
here. : 


Program optimization 
Most of the optimizations traditionally 


associated with sequential languages are 
applicable (eg., removal of loop invariants, 


constant folding, common sub-expression 
elimination, etc.). Since such optimizations can 
be thought of as a manipulation of the program 
text, their implementation is 
machine-independent. Techniques for these 
optimizations are well documented elsewhere. 


Machine-dependent transformations 


The fixed length of the vector registers on the 
Cray-l necessitates splitting long vectors into 
64-word slices (and usually a remainder), 
referred to as vector slicing. Each 
"new-eop-scope" construct will be governed by 
some loop structure; so each nested structure 1s 
encountered more than once. This creates 
difficulty when any part of the enclosed 
structure has a different extent of parallelism. 
(i) Creating slicing loops. 

The run-time-set describing the eop will come 
from one of two sources: 


(a) A “setfrom" operation, which defines a 
base, step and length of an eop from an 
index set expression and determines its 
regularity; 

(b) A “truevalues" operation which defines an 


irregular subset of an enclosing regular or 
irregular eop. 


The "truevalues" operator does not affect the 
choice of elements to be processed on the Cray-1; 
rather, it defines which elements of the result 
vectors are meaningful and,thereby , which 
elements of the parallel variables are to be 
updated after the calculation is completed, using 
a vector merge operation. These elements are 
always a subset of the current eop. On the other 
hand, "setfrom" defines a vector length, which 
may be arbitrarily large. The target machine 
restricts the vector length to a maximum of 64 
elements, so a "setfrom" operation must always 
define a slicing, which will remain unaffected by 
any nested "truevalues" operations. Therefore, 
"new-eop-scope™ constructs setting new extents 
which were defined by "setfrom" operations may 
not be nested. These extents must be stacked 
explicitly and recovered explicitly, as the 
language scope rules dictate. Extents set by 
"truevalues" operations are held and nested as 
before; we note the necessity of merging the 
original and result vectors according to the 
extent at the close of every "truevalues" scope. 


Finally, we emphasize that these transformations 
are machine-specific, whereas the selection of 
the graph operations “setfrom” and "truevalues" 
is not. Although they bear an inexact analogy 
with hardware features specific to the target 
machine (respectively the VL- and VM-registers), 
their properties were dictated by the 
requirements of the language, as described in the 
previous section, not for the convenience of the 
target architecture. 
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Problems associated with the implementation of 
this plan fall into two categories. We must 
decide how to deal with structures which cannot 
be successfully sliced, such as certain scalar 
assignments, procedure calls, gotos and nested 
slicing loops; and we must be able to make good 
all implications of execution order expressed in 
the language. 


(ii) Statements with non-conforming extents of 
parallelism. 


When the compiler discovers a structure 
equivalent to 


within 1:100 do 


begin 

al#] := bl#] + i ; 

i :=i+ 1; (* a scalar statement *) 
bl#] := j * al#] 

end 


the scalar statement cannot be included in any 
slicing loops which are created (otherwise i 
would be incremented a number of times). To try 
to reduce the slicing overheads, the compiler 
will attempt to perform a (machine-specific) 
optimization equivalent to 


within 1:100 do 


begin 
al#] := bl[#¥] + i ; 
bl#] := j * al#] 
end; 

is:=it+l 


where the integrity of the implied loop is 
preserved. Problems would arise if the scalar 
factor in the second assignment were 1, not j; 
the loop would have to be broken to produce 


within 1:100 do al#] := bl#] + i ; 
i:z=id+tl; 

within 1:100 do bl#] := i * al#] 

Although this example could be resolved by 
precalculation, in general we cannot restructure 
to maintain a single slicing loop. We note that 
the scalar statement must be moved outside all 
the slicing loops. This is non-trivial if 
several nested conditional ("truevalues") scopes 
are to be interrupted; all must be stored to 
maximize the effectiveness of the slicing when 
resumed and to ensure that result vectors are 
composed correctly. 


(iii) Shift operator 


Two problems are considered here. The first 
occurs in statement sequences such as 


within 1:100 do 
begin 
al#] := bl#] +1 ; 
cl#] := al# shift 1] 


end 


where the difficulty derives from the statement 
order. Contending that the Cray-1l performs the 
body of the construct in two slices using eops of 
1:36 and 37:100, we find on the first pass the 
second vector statement attempts the assignment 
c[36] := a[37] 

thereby assuming that al37] has been set up 
previously. In fact al37] is part of the second 
slice of the vector and would normally be set up 
on the second pass. Again, splitting the slicing 
loop is necessary to accommodate the general 
case, as below. 


within 1:100 do al#] 
within 1:100 do c[#] 


bl#]) +1 ; 
al# shift 1] 


Second, problems arise 
statement such as 


when performing a 


within 1:100 do 


al#] := a [# shift -1] 


To avoid overwriting al36] (with al35]) before it 
is copied into al37], it is necessary to 
introduce a temporary storage array (a“), and 
implement the statement as 


within 1:100 do 
a’ [#] := a [# shift -1]; 


within 1:100 do 
al#] := a’[#] 


(iv) Set operators 

The last problem described here involves the use 
of the "setfrom" operator. Consider the 
statement 


within (i:[2]j + p:[3]q) do 
alf#] := al#] +1 


where the variables i, j, p and q have, for 
example, the values -1l, 17, 1 and 25 
respectively. 

The problem arises in that the run-time-set is 
difficult to construct for efficient use. We 
require a run-time-set representation of the set 


{ -1, 1, 3, 4, 5, 7, 9, 10, 11, 13, 15, 16, 
17, 19, 22, 25 } 


which is far from regular. If the index set 
segments were non-intersecting, they might be 
processed separately. Otherwise, as in this 
case, a set must be built in steps (at best) of 
the highest common factor of the two supplied 
steps, and using the maximal bounds. This may be 
an expensive operation, and could produce a 
run-time-set with many more slices than the 
individual components required. 
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a Pa eae a 


The compiler generates code for a target machine 
via an abstract description of the machine. 


The incentive for building an abstract 
description of the target machine was not only 
the desire for machine independence. It arose 
also from the nature of the Cray-l architecture, 
and from the fact that its instruction set 


contains a large number of special cases. For 
instance, the CAL instruction 

Aj A; + A, 
stores the integer sum of registers A; and A, in 


A;, unless either j or k is zero. If j is zero, 
the instruction uses the number 0 instead of Ag; 
if k is zero, 1 is used instead of Ag. These 
special cases are common to much of the 
instruction set, and it would be inefficient for 
the compiler to look for every such case 
explicitly. Instead, the compiler sets up an 
abstract description of each instruction, and 
treats special cases as different instructions. 
An instruction is defined in terms of the 
register classes it accesses and an Aoperation. 


A register class is a set of target machine 
registers which are addressed identically by a 
field in a target machine instruction. Immediate 
constants are conceptually held in registers and 
are classified in exactly the same way. The 
description of a register class is supplemented 
by the types of values it can hold and by the 
number of target registers of that class. 


For the Cray-l1, an examination of the instruction 
set produces the register type description 
Aregclass = ( AO, Alto7, AOto7, SO, Slto7, 
SOto7, BOto63, TOto63, VOto7, 
VL, VM, Amemory, Aconst0, 
Aconstl, --- ); 


AO and SO are on occasion separated from their 
respective superclasses as their appearance in 
some instructions implies use of a special value 
(as illustrated above); also they are the only 
registers which may be used in determining 
conditional branches. From the information 
already on the graph, it is possible to determine 
the set of register classes which could hold the 
value expressed by any expressional node. 


The Aoperations implemented on the abstract 
machine are derived from the operators present in 
the language. They are described by the type 


Aoperation = ( (* Arithmetic operations *) 
Aintplus , Aintminus , -- , 
Arealplus, Arealminus, -- , 
Aor » Aand > 7"; 
Asetplus , Asetminus , -- , 


(* Relational operators *) 
Arealgt , Arealge ,-- , 
Aintgt , Aintge , 7"; 


(* Standard functions *) 
Asin » ~- , Afloat, -- , 
Asumreal , -- , Afirst, -- , 


(* Eop manipulation *) 
Asetfrom, Atruevalues, 
Asetcomplement, Aanymembers ); 


An instruction description includes information 
on the mnemonic, etc., and also the (two) source 
and (one) destination register classes. 
Instructions are classified by the Aoperation 
which they implement. For instance, the above 
CAL instruction would appear in the abstract 
machine as two instruction descriptions, with the 
destination and source classes: 


(A0to7 
(A0to7 


Alto7 + Alto7 ) 
Alto7 + Aconstl) 


With the instruction set described thus in an 
abstract way, the choosing of instructions for 
each expression-evaluating operation on the graph 
can be performed by the procedure 


procedure chooseinstructions; 


This associates an instruction with each 
Aoperation in an expression, and assigns a unique 
register class to each valued node, taking into 
account future uses of the result. Its operation 
is machine-independent. 


An Aoperation for which there is no single 
instruction 1s implemented by an abstract code 
sequence. Included in the abstract description 
of a code sequence is a set of registers, called 
Pregisters, which refer the description to 
registers used by the code sequence (for passing 
values to and from the code sequence, or for 
working space needed by the code sequence). The 
purpose of these Pregisters is to “parameterize” 
the code sequence, so that the registers which 
they use do not need to be pre-allocated, and do 
not need to be fixed for all invocations of the 
code sequence in a program. 


The Pregisters are described by the type 


Pregister = (Ans, Argl, Arg2, Dummy, Locall, 
Local2, Local3, -- ); 


Each code sequence is held in a record of type 
"codedescription"™ which describes the register 
classes of each Pregister used by the code 
sequence and timing information, in addition to 
the code sequence itself. An instruction 
description is set up to look like a code 
sequence of length one. 
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The code sequences for Aoperations are held by a 
variable 


Acodesfor : array [ Aoperation |] of 
Acodelist ; 


A list of sequences is held to enable the 
compiler to choose the implementation most 
appropriate to the source and destination 
register classes. To generate machine code for an 
expression, an actual register of the required 
class is found for each Pregister and this 
replaces the Pregister in the abstract sequence, 
leaving target machine code. 


Abstract code sequences are also used to 

represent data transfer between register classes 

using the array 

Path : array [ Aregclass, Aregclass ] of 
Acodesequence ; 


as some moves are non-trivial. The same means may 
be exploited in representing run time support 
code, such as subrange or array subscript checks. 


The allocation of actual registers to expression 
values treats program variables and temporary 
storage alike; the allocation of registers to 
keys for scalar variables allows a variable to be 
allocated different store from one use to 
another. While this produces efficient register 
use, it leads to considerable overheads in the 
production of meaningful diagnostics. 


The most critical regions of a block are the 
bodies of the innermost loops as these are the 
sections which will be executed most frequently. 
Register allocation is therefore performed from 
the innermost level of loop nesting, working 
outwards. ; 


CONCLUSIONS 


This paper has described the approach taken in 
the construction of a compiler for an array and 
vector processing language. The compiler has the 
task of compiling machine-independent programs, 
while at the same time exploiting special-purpose 
architectural facilities. 


It has been found that a large proportion of the 
compiler is machine independent, partly because 
of the abstract representation which have been 
used to represent both the source program and the 
target machine. 
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Abstract -- Parallel P-code is an intermediate 
compiler language for parallel processors. It was 
originally designed as part of a Parallel Pascal 
compiler for NASA's Massively Parallel Processor 
(MPP). However, it should also be suitable for a 
wide variety of high level languages and parallel 
architectures. Parallel P-code is based on a P- 
code language for serial processors; this paper 
describes the extensions which were necessary for 
the parallel environment. 


Introduction 


Parallel Pascal was designed to be a high-level 
programming language for the MPP [1] and other 
parallel processors. Parallel Pascal is 
characterized by having very few extensions to the 
basic Pascal language in an attempt to provide the 
programmer with good primitive tools rather than 
canned solutions. 


Since its initial specification [3], the design 
of Parallel Pascal has been refined and simplified 
through experience with a Parallel Pascal to 
standard Pascal translator [2] and the development 
of a compiler. The final language specification 
includes array expressions, conditional array 
assignment with a where statement and_ the 
Manipulation of subarrays with consecutive 
elements. Portability of the language to other 
parallel architectures is enhanced by implementing 
permutation operations with standard functions. A 
complete specification of the language is given in 
references 4 and 5. 


The Parallel Pascal compiler consists of a 
syntax analysis "front end" and a code generation 


"back end." These two phases of the compiler 
communicate through an intermediate language 
called Parallel P-code. The compiler and its 
intermediate language are based upon the P-4 
Pascal compiler [6]. 

The most Significant difference between 


standard P-code and Parallel P-code is the way in 
which they treat data types. In standard P-code 
only a few data types are supported integer,» 
real, Boolean, character, set, and _ pointer. 
Parallel Pascal, howevers permits (and in fact 
encourages) the manipulation of arrays as 
aggregates, requiring additional data types. 
Parallel P-code therefore defines a set of base 
types and provides facilities for constructing 
more complex data types. 


The base types are very similar to those in 
standard P-code: integer, real, character, and 
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Boolean. With the appropriate definition 
statements, these base types are used to define 
all structured types. 


The definition of subrange types, set typeS»s 
file types, and pointer’ types is fairly 
straightforward. Objects of these types can be 
directly manipulated because their behavior is 
known at compile time. They are defined by 
statements of the form: 


eRANGE rngsl5 sdefine subrange 1..5 

eSET sst sd00 sdefine set of 548 

-FILE ftypeschar sdefine file of char 
Array and record types cannot be s0 easily 
manipulated, because they are not- always 


manipulated as entire entities. 


Arrays 


When arrays are manipulated in standard P-codes 
almost all operations are performed on scalar 
elements. (The exception to this rule is a 
provision for moving blocks of data from one place 


to another.) Parallel Pascal, however, requires 
operations to be performed upon arrays as 
aggregates. It was necessary to provide a 


formalism for specifying these parallel operations 
in the intermediate language. 


In order to process array operations, the code 
generator must know at least the size of the array 
and the type of elements. For more sophisticated 
operations (e.g. operations which. involve only a 
subset of the array) it must also know the layout 
of the array the number and range of array 
dimensions. This information can be divided into 
two portions, static and dynamic. 


The static portion represents information that 
is known at compile time. It consists of such 
things as the base type (i.e. the type of the 
array elements), the number of dimensions, and the 
low and high bounds of each dimension. This 
portion can be considered the logical 
specification of the data. 


The dynamic portion of an array type consists 
of the address of the array and the specification 
of which elements are to participate in an 
operation. This portion therefore represents the 
physical specification of the data - where it is 
stored and what portions of it (e.g. which array 
elements) are to be affected. 


The static and dynamic information is 
collectively referred to as an array descriptor. 
These descriptors are essentially generalized dope 
vectors which specify not only the type and size 
of the data but also the subset upon which 
operations are to be performed. An _ important 
consideration is that while Parallel P-code 
performs transformations upon descriptors, it does 
not specify their exact format. In facts a 
typical code generator will probably treat the 
descriptors as conceptual entities which have no 
physical counterparts at run time. (A possible 
exception to this would be the treatment of array 
arguments to user-defined functions and 
procedures.) 


The static portion of an array descriptor is 
specified in Parallel P-code via the "™.ARRAY" 
pseudo-operator. The base type (i.e. array 
element type)» number of dimensions, and range of 


all dimensions are specified. For instance, the 
array type defined by 

arr = array [1..5,2..6] of integer; 
would be defined in Parallel P-code with the 


statement: 


ARRAY arr,integers2s1553236 


Parallel Pascal provides the parallel reserved 
word for declaring that an array should be 
allocated in the parallel array memory rather than 
the sequential control unit memory. If an array 
is declared parallel, this fact is reflected in 
Parallel P-code by a negative rank. 


Records 
In order for arrays of records to _ be 
intelligently processed, it is necessary for the 
intermediate language to define descriptors for 
records as well as arrays. Like array 
descriptors, record descriptors consist of a 


static and a dynamic portion. The static portion 
specifies the record: the fields and their types. 
The dynamic portion specifies the address of the 
record and the field which has been selected for a 
particular operation. 


Because the structure of a record is not as 
regular as the structure of an arrays a single 
type definition statement for the static portion 
of a record would be cumbersome. For that reason, 
Parallel P-code defines records according to the 
fields which they contain. The pseudo-operator 
used to define record components is ™.RECORD". 
One ™.RECORD"™ is generated for each field. 


Parallel Pascal, like standard Pascals permits 


variant records. When a record has variants, 
several components will share the same memory 
allocation. (Only one is in use at any given 


time.) Parallel P-code permits the specification 
of an offset with each field declaration. A 
record definition in Parallel P-code consists of a 
sequence of ".RECORD" statements. Normally» each 
successive field in the same record is assigned a 
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this 
is 


sequential location in memory. However » 
behavior can be overridden so that a field 
aligned at the same offset as a previous field. 


The general syntax of the ™.RECORD" pseudo-op 
is 


-RECORD rname ,fname,offset ,ftype 


where "rname" is the name of the record being 
defined, "fname" is the name of the field being 
defined, "ftype"™ is the type of the field, and 
"offset" is either "nil" or the name of a 
previously-defined offset. If “offset™ is the 
literal string "nil" the next sequential memory 
location is assigned; otherwise, the new field 
"fname" is aligned with the existing field 


"offset". As an example, the record defined by: 
record 
x: integer; 
y: real; 
case Boolean of 
false: (zf: integer); 
true: (zt: real) 


end; 


rec = 


would be translated to 


~-RECORD 
«RECORD 
~RECORD 
e-RECORD 


rec »X »snil,integer 
recsysnil,real 

rec »zf snilsinteger 
recszt»zfsreal 


Variable Allocation 


In order to examine the specification of the 
dynamic portion of array and record descriptors it 
is necessary to first consider the way in which 
Parallel P-code allocates and references local 
variables. 


Corresponding with each called function or 
procedure is an area on the runtime stack called 


the stack frame (or activation record). In 
addition to the arguments to the function or 
procedure, the local .variables,; and space for 


temporary results, the stack frame includes some 
linkage information. In standard P-code this 
includes the return address, space for a returned 
function result (this field is unused for 
procedures), and two locations for the static and 
dynamic links. The static and dynamic links point 
to the appropriate previous stack frames. The 
hypothetical machine which implements P-code 
contains a non-user-accessible register called the 


"frame pointer" which hold the address of the 
current stack frame. 
Parallel P-codes like standard P-codes 


addresses operands according to the lexical level 
at which they are defined. However, in order to 
deal with objects on an abstract basis (and 
thereby avoid specifying memory allocation in the 
intermediate language), physical address offsets 
are not used. Rather, variables are referred to 
by a logical index, so that the lexical address is 


(level,index) 


Hence, Parallel P-code does not define the exact 
format of a stack frame; this is left to the 
implementation. 


An index is assigned to all local variables, 
procedure or function arguments, and the function 
return value (if the routine is a function). All 
of these share the same set of indices. The index 
zero is reserved for the result of a function. 
Arguments are assigned indices starting at 1, and 
local variables are assigned indices beginning 
immediately after the last argument. 


Although function (or procedure) arguments and 
local variables share the same set of indices, 
they require somewhat different treatment when 
they are defined. Thus, two statements are used 
for arguments and local variables. The definition 
statements are: 


eARG 
LOCAL 


index stype srv 
index »type soverlay 


where "index" is the index number, "type" is the 
argument types and "rv" (for “.ARG") is zero if 
the argument was passed by value or one if it was 
passed by reference. The "overlay" field (for 
".LOCAL") is similar to the "align" field for the 
"RECORD" pseudo-operator. It is normally zeros 
indicating that the local variable should be 
allocated the next available memory location (or 
locations). If it is non-zero, it specifies a 
previously-defined local variable (at the same 
lexical level); the new variable is to be overlaid 
on the memory allocated for the specified old 
variable. 


Parallel P-code provides two statements, 
"ENTRY" and ".EXIT" to define the lexical level 
of the procedure they enclose. 


Runtime Operation 


In Parallel P-codes as in standard P-code, all 
operations are performed by means of a run-time 
Stack. Data is loaded onto the top of the stack, 
manipulated on the stack, and stored from the top 
of the stack. In standard P-code, data is 
manipulated in one of two ways. The first way is 
to load the data onto the stack and manipulate it 
directly. This is the most common method (in 
standard P-code) and it works well because Pascal 
usually deals only with one item at a time. An 
alternate way is to perform a data transfer of a 
compile-time specified number of elements between 
two addresses which are computed at runtime. In 
this second case (used in assignment statements 
where both sides are identical arrays or records) 
the addresses, not the data, reside on the stack. 
They could be called very simple descriptors 
because they describe where the referenced data is 
(or is to go). 


Parallel P-code also makes use of these two 
mechanisms. When an operation is performed on 
scalar data, the data itself is loaded onto the 
runtime stack, manipulated, and stored from the 


242 


stack. When an operation involves an array or 
record, or some combination thereof, the second 
method is called for. However, because Parallel 
Pascal provides more flexibility in aggregate 
operations, an address alone is not sufficient; 
rather, information must be provided about the 
shape and type of the data. The type information 
is supplied by the static descriptor (i.e. by an 
"ARRAY" or ".RECORD" definition). The runtime- 
dependent shape information is provided by the 
dynamic descriptors on the runtime stack. 


The runtime nature of an array is determined by 
two dynamic attributes: the address of the array 
and the index ranges of its dimensions. The 
dynamic (physical) portion of the array descriptor 
which resides upon the runtime stack specifies 
these attributes. This information is constructed 
by loading a "blank" descriptor (one which 
specifies the array address but does not specify 
index ranges) and then “filling in" the index 
ranges using one of three operators corresponding 
to Parallel Pascal's array indexing modes: "1X0" 
(select entire index range), "“IX1" (index by a 


scalar), or "IX2" (index by a subrange). Each 
successive index instruction is applied to the 
next unspecified array index range. The 


intermediate language does not specify the format 
of the dynamic array descriptor; this is solely 
the domain of the code generator. 


In contrast with arrays, only one component of 
a record may be specified at a time. However, 


unlike arrays, the fields in a record are non- 
homogeneous. The manner in which the target 
machine stores the fields of the records will 
affect how a record field is specified; the 


compiler cannot simply calculate a constant offset 
(as is done in standard P-code). All record field 
selection in Parallel P-code is performed with 
symbolic names. The names correspond to the field 
names defined in ".RECORD"™ statements. 


The exact format of a record descriptor is not 
known to the compiler "front end." Instead, the 
record descriptor is constructed with the aid of 
the "select" ("SEL") instruction. A descriptor 
that specifies the entire record is loaded onto 
the stack; this is similar to the "blank" 
descriptor described above for arrays but may be 
used without further modification to access the 
entire record. The "SEL" operator is used to 
select a field from the record. This replaces the 
record descriptor on top of the stack with a 
modified descriptor that indicates the address of 
the record and the selected field. If that field 
is itself a records another "SEL" is then used to 
select a field within that sub-record. 


Descriptors for more complex structures (e.g. 
arrays of records, arrays within records) are 
constructed by repeated application of the 
techniques described above. 


When an operation is performed on scalars, the 
address where the result is to be stored is loaded 
onto the stack, the scalar expression is 
calculated, and a "store indirect" is performed to 


store the result of the expression (on top of the 
runtime stack) at the specified address (the 
second item on the runtime stack). 


When an operation is performed on a structured 
types the result must be stored in a temporary 
area and a descriptor for that temporary placed 
upon the runtime stack. The automatic allocation 
of the temporary storage to which the descriptors 
refer is the responsibility of the implementation. 


Parallel Control 


Parallel Pascal provides the standard Pascal 
control statements if, case,» for,» while, and 
repeat-until. The implementation of these 
control-flow constructs in Parallel P-code is 
identical to the implementation in standard P- 
code. Parallel Pascal also provides a construct 
to allow masked assignment of arrays - the where 
statement which cannot be (efficiently) 
implemented with the scalar-oriented control 
mechanisms of standard P-code. 


Since the where statement controls array 
assignments, the implementation in Parallel P-code 
will only affect stores. In general, SIMD-class 
parallel processors associate with each processing 


element a flag known as the "mask bit" or 
"activity bit." This bit controls whether or not 
the processor is enabled or disabled. The 


collection of mask bits for each processor can be 


considered to be a Boolean “mask array." The 
controlling expression of a where statement in 
Parallel Pascal is a Boolean array; hence, it is 


natural to implement the where statement by using 
this array as a mask array. 


The current conditional status of a set of WN 
nested conditionals can be determined by using a 
stack of length N bits. If the current 
conditional state is A and a where statement is 
encountered which evaluates to Bs, the new 
conditional state is AB (the Boolean product of A 


and B). At some later point, if an otherwise is 
encountered, the desired conditional state is 
A(~B). This can be computed by 


A(~B) = ABB@A 


The stack implementation is defined as follows. 
Initially the stack is empty and all processors 
are enabled. When a where conditional is 
encountered, a Boolean "and" is performed with the 
current top of the stack (if the stack is non- 
empty) and the result is pushed onto the stack. 
When an otherwise is encountered, a Boolean 
"“exclusive-or™ is computed between the top two 
elements of the stack and the result replaces the 
top of the stack. (If the stack contains only one 
item, it is complemented.) When the end of the 
conditional is encountered, the stack is popped. 
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These three operations pushing a new masks 
complementing the current mask, and popping the 
mask are provided in Parallel P-code by the 
"WHR", "OTW", and "ENW" operators. 


Conclusion 


Parallel P-code has the following extensions 
relative to standard P-code for parallel languages 
and processors. First, it provides a mechanism by 


which non-primitive types may be_ specified. 
Second, it provides an abstract addressing scheme 
for allocating and referencing automatically- 
allocated variables. Thirds it provides 
mechanisms for operating upon arrays», array 
subsets, and individual array elements. Fourth» 


1t provides a symbolic mechanism for defining and 
referencing the fields of a record structure. 
Finally, it facilitates conditional assignment by 
providing mechanisms for establishing, altering 
and removing a Boolean mask array. 
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Abstract 


This paper introduces the DC1 flow schema, a 
pragmatic asynchronous parallel computation 
model. The disadvantages of the currently exist- 
ing flow schemas are discussed. The representa- 
tion and the evaluation of computation in the DC1 
flow schema are described. The operations of the 
primitive operators are described. Two applica- 
tions of the DC1 flow schema are shown. These 
are the representation and evaluation of the non- 
deterministic programs, and of the programs pro- 
cessing the infinite data structures. The DC1 
flow schema is compared with the data flow schema 
and the demand-driven computation in those appli- 
cations. 


Keywords 


flow schema, computation model, 
computation representation, 
computation evaluation, 

DC1 flow schema, data flow schema, 
demand-driven computation 


1. Introduction 


The von Neumann architecture has been the 
basis of most of the computers built to date, 
even though this architecture imposes several 
unnecessary restrictions [Back78] [Myer78]. For 
example, the thinking and programming are done in 
the “primitive word-at-a-time” style. Further, 
the evaluation of the programs has the ‘von Neu- 
mann bottleneck” between the CPU (central pro- 
cessing unit) and the store [Back78]. 


Based on the recognition, there has been 
some new approaches for computation. The new 
approaches usually use the side-effect free 
(functional) languages as programming languages. 
They also use asynchronous parallel computation 
models to represent and evaluate programs. One 
of the objectives of these computation models is 


to exploit the implicit concurrencies in pro- 
grams. We concentrate on the asynchronous paral- 
lel computation models in this paper. 


A computation model may be called a “flow 
eghema's The flow schema is defined as follows: 


The flow schema is an operational model 
of computation. It consists of a 
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representation of computation and an in- 
terpretation which operates on the 
representation. 


The interpretation in this definition may be 
called the "evaluation". 


The currently existing asynchronous parallel 
computation models include the data flow schema 
[DEMi75] [AgAr82] and the demand-driven computa- 


tion [KLP 79] *. ‘The data flow schema can be 
classified as the static data flow schema and the 
dynamic data flow schema on the basis of enabling 
rule. In this paper the demand driven computa- 
tion, the delayed evaluation and the lazy evalua- 
tion [Hend80] [FrWi76] are used interchangeably. 


There are some drawbacks in the above exist- 
ing flow schemas. The demand-driven computation 
is usually inefficient compared to the data flow 
schema. It is because the ‘demand’ signal should 
propagate [DaKe82]. As a result, the propagation 


delay time is included in the execution time. 


Also, the amount of communication among the nodes 
is doubled, which results in the inefficiency of 
evaluation. The static data flow schema usually 
exploits less concurrency than the dynamic data 
flow schema. It is because the enabling rule of 
the static data flow schema is stronger than that 
of the dynamic data flow schema. While the 
dynamic data flow schema is the most efficient, 
it may not be safe when it evaluates the programs 
that process the infinite structure [ArPi82]. 


In this paper we propose a new flow schema, 


called the DC1 flow schema 3, The representation 
and evaluation of computation in the DC1 flow 
schema are described. Some applications of the 
DC1 flow schema are given. They are the 
representation and evaluation of the nondeter- 
ministic programs, and of the programs which pro- 
cess infinite data structures. Through these 
examples it is shown that the DC1 flow schema is 


as efficient as the dynamic data flow schema a 


| this definition is a modification of that in 
[Weng79]. In that paper, this term was used to 
define the data flow schema. 


7 Since the demand-driven computation is a 


standard terminology, we use it rather than the 
the demand flow schema. 


3 The DC1 is an acronym of Data Control flow 
schema version 1. 


4 For some cases such as the nondeterministic 
program evaluation, the DC1 flow schema is more 
efficient than the dynamic data flow schema as 
shown later in this paper. 


Also, it is shown that the DC1 flow schema is 
safe in evaluating the programs that process the 
infinite structures. 


2. The DC1 flow schema 


The computation is represented as a directed 
graph in the DC1 flow schema. This graph is 
called the "DC1 graph" in the sequel. Evaluation 
of the computation is done by following the two 
rules: enabling rule and execution rule. These 
are described later in this section. 


Zeiss 
ties 


The DC1 Graph : Structure and Characteris- 


The DC1 graph is a directed graph consisting 
of a set of nodes and a set of directed ares. 
The nodes in the graph are instances of the node 
schemas of the DC1 flow schema. These node sche- 
mas are shown in figure 2-1. The node schemas 
shown in figure 2-1(a),(c),(e),(f),(g@) and (h) 
are similar to the actors in the data flow schema 
in [Weng79]. One difference is that every node 
in the DC1 flow schema may have input/output con- 
trol signal arec(s) besides the data arcs. Note 
that the LINK node and the SINK node can handle 
both control signals and data in the DC1 flow 
schema. (See figure 2-1(b) and (d)). The RS 
(Random Selector) node schema, shown in figure 
2-1(i), is introduced in the DC1 flow schema. 
is a nondeterministic node. The operation of 
each type of node is described in section 2.2.3. 
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Figure 2-1 The Node Schemas in the DCt Flow Schema 
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of information 
The first type 
The data type may be integer, 


real, boolean, string, error, structure, etc. 5 
The value in one of the first five types is 
called the simple value. The structure may be 
represented as the stream value internally. All 
the variables in the DC1 flow graph is either the 
Simple value or the stream value. The second 
type is the control signal type. There are two 
control signals of the control signal type in the 
DC1 graph : DELAY-ENABLE and FORCE-ENABLE. The 
DELAY-ENABLE control signal is used to delay the 
enabling of a node whose inputs are available. 
The DELAY-ENABLE control signal may have one of 
the two values : EN and NEN. The FORCE-ENABLE 
control signal is used to force a node to be 
enabled. These two control signals provide the 
DC1 graph with additional means for sequencing in 
addition to the data dependent sequencing. 


There are two types 
representable in the DC1 graph. 
is the data type. 


In the DC1 flow graph we restrict the usage 
of the two control signals as follows: 
(i) a node cannot have both types of control 
Signal inputs at the same time, 
(ii) a node cannot have both FORCE-ENABLE control 
Signal input arc and data input arcs at the same 
time. 


There are two types of arcs in the DC1 flow 
schema. The first type is the data are. It 
transmits data tokens which contain values. It 
is represented as an arrow with a solid line. 
The second type is the control signal are. It 
transmits control signal tokens which contain 
control signals. It is represented as an arrow 
with dashed line. 


2.2. Interpretation in the DC1 Flow Schema 


The interpretation of the DC1 graph consists 


of the enabling rule and the execution rule. The 
enabling rule governs when a node _ becomes 
enabled. The execution rule governs how an 


enabled node is executed. The interpretation in 
the DC1 flow schema is called the "DC1 evalua- 

tion". 
2.2.1. The Enabling Rule of the DC1 Evaluation 


An actor in the DC1 flow schema is enabled based 
on the following three rules. 


(E1) If a node has no input control signal arc, 
it is enabled when all its necessary input 
data are available. 

If a node has a DELAY-ENABLE input control 
Signal are, it is enabled when all its 
necessary input data are available and the 
control signal is available (regardless of 
the value of the delay-enable control sig- 
nal). 


(E2) 


> as in the Id [AGP 78], it may include pro- 
cedure definition, manager definition, and 
manager object in the data type. 


(E3) If a node has a FORCE-ENABLE input control 
signal arc, it is enabled if the control is 
available. 


Note that the enabling rule for nodes without any 
input control signal arc is purely data dependent 
(by rule E1). The DELAY-ENABLE input control 
signal are for a node may be used to delay the 
enabling of the node when its input data are 
available (by rule E2). The FORCE-ENABLE control 
signal forces a node to be enabled (by rule E3). 


Cweece 
tion 


The Execution Rule of the DC1 Evalua- 


A node in the DC1 flow schema is executed on 
the basis of the following execution rules in the 
DC1 evaluation: 


(X1) Enabled nodes are executed concurrently or 
in random order. 

Tokens on the input arcs are consumed before 
execution. After execution, tokens may be 
generated on the output arcs of the node. 
The execution of the operation of the node 
is performed as follows: 


(X2) 


(X3) 


(X3.a) A node without any input control sig- 
nal are : performs the operation of 
the node on the input data, 

A node with the FORCE-ENABLE input 
control signal arc : performs the 
operation of the node, 

A node with the DELAY-ENABLE input 
control signal are : if the control 
signal value is EN, it performs the 
operation of the node on the input 
data; otherwise (i.e., control signal 
value is NEN) do nothing. 


(X3.b) 


(X3.c) 


A node with the input FORCE-ENABLE control signal 
arc performs its operation which does not require 
any input data (see rule X3.b). Examples of this 
type of node include a constant generation func- 
tion, which generates a constant, and a RS node, 
which is described later in the next section. 
Following the execution rule (X3.c), a node with 
the DELAY-ENABLE input control signal consumes 
token(s) but not generate one if the control sig- 
nal value is NEN. 


Ceti se 
schema 


Operation of Nodes in the DC1 flow 


The nodes in the DC1 flow schema can be 
classified as the deterministic and the nondeter- 
ministic nodes. There are two types of the non- 
deterministic nodes: MERGE and RS(Random Selec- 
tion). The other nodes are the deterministic 
nodes. 


The operations of the deterministic nodes 
are similar to those of the corresponding actors 
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in the data flow schema in [Weng79], if we 
neglect the control signal arec(s) of the nodes. 
The operation of any deterministic nodes shall be 
described when it is necessary in this paper. 
The operations of the two nondeterministic nodes 
are described below. | 


The nondeterministic MERGE nodes in the DC1 
flow schema may operate on the simple values as 


well as the stream values 6. The operation of 
the MERGE node is to reproduce an input data 
token (in an input data arc) onto the output data 
are as soon as one is available. This operation 
implies that the output of a MERGE node depends 
not only on the input data values but also on the 


time when the input data tokens arrive to the 
node. Thus the operation of the nondeterministic 
MERGE node is not a function. 


A RS node which has the FORCE-ENABLE control 
signal are as its input arc is called the RS 
node. A RS node which has the DELAY-ENABLE con- 
trol signal and/or the data arec(s) as its input 
are(s) is called the RS node. For the purpose 
of explanation in this paper only the operation 


of the RSog node is described below. 


<Operation of the RSs node> 


(a) selects one output control signal are ran- 
domly among the output control signal ares, 

(b) sends a EN control signal on the selected 
output arc; sends a NEN control signal on 
each unselected output arc. 


Only one output are of the RS node transmits 
the EN control signal. All the other output arcs 
transmit the NEN control signals. 


3. The DC1 Flow Schema for the Nondeterministic 
Programs 


A program is said to be nondeterministic if 
a given input state can lead to more than one 
possible terminal state [Gold82]. It may happen 
if the program is permitted to make a random 
choice of its next action from a number of possi- 
bilities. In this section we will describe the 


‘DC1 graph and the DC1 evaluation for the non- 


deterministic programs a 


The or operator in [Hend80] is used to 
represent the random choice operation in this 
paper. Consider a program as follows: 


6 This is a difference between the MERGE node 
in the DC1 flow schema and that in the data flow 
schema in [ArBr82]. In [ArBr82] the MERGE actor 
operates only on the stream values. But it is 
necessary for the MERGE node to operate on the 
simple values in order to collect values generat- 
ed mutually exclusively. 


7 In fact, these programs may be called the 
indeterminate programs 


The value of the function f may be any one of the 
values of the functions Bis sees Be 


A DC1 graph for the program <1> is shown in 
figure 3-1. Each function g., 1< i <n, is 
represented as a node in the figure. (This 
representation is sufficient for the description 
below). There are two initial tokens; one is a 
data token which contains the value of x and 
another is a control signal token of the FORCE- 
ENABLE control signal to the RS a node. 


This DC1 graph is evaluated as follows. 
Initially, only the RS__ node is enabled. (by 
the FORCE-ENABLE control signal). Being executed 
this RS node generates an EN control signal and 
(n-1) NER control signals. A function, say g., 
whose input DELAY-ENABLE control signal value “is 
EN is enabled next. Other functions absorb both 
the data token and the control signal token. 
After the function g. completes its operation its 
result is transmitted as a data token from the 
function g. to the MERGE node. Then, the MERGE 
node produces a token, whose value is the same as 
that of an input data token, which is the result 
of the graph. And the evaluation is completed. 


As a more practical example, we show a DC1 
graph and DC1 evaluation for the following non- 
deterministic program ~. 


choice(n) = if n=1 then 1 
else ( choice(n-1) orn ) 


- — =e oe a 


— —_— - 


d-e 
d-e: DELAY - ENABLE 
$-e: FoRcE - ENABLE 
Figure 3-1 A DC1 graph for the program <1> 


8 This program is from [Hend80]. 
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Given a positive integer number, the program 
renders an arbitrary integer ranging from 1 to 
the given input value. 


A DC1 graph representation of the above pro- 
gram is shown in figure 3-2. The node EQFFE in 
the graph is a predicate operator which generates 
a data token and a control signal token. The 
control signal generated at the EQFFE operator is 
used as an input FORCE-ENABLE control signal to 
the RS node. A small square box with a con- 


stant value in it in the DC1 graph represents a 


constant generating function. A subgraph in the 
inner box represents the nondeterministic selec- 
tion of a value. As shown in the figure, there 
are two initial tokens. One is a data token which 
contains the value of n. Another is a data token 
whose value is the function definition of the 
“choice” function, which is represented as 
{choice}. 


The operation of the APPLY node is invoking 
the recursion. The operation of the LINK, node 
is to reproduce its input data token on the out- 
put data are(s). The operation of the EQFFE 
operator is described as follows: 


compare the two data inputs; 


if equal 
then generate “true” value 
on the output data arc; 
else generate “false” value 
on the output data arc; 
generate a token 
on the output control signal arc; 
endif 
n F choice § 
fer d-e : DELAY- ENABLE 


f-e: Poke - ENABLE 


Figure 3-2 A DC1 graph for the ‘choice’ program 


The evaluation of the DC1 graph in figure 
is done as follows. The first node enabled 
executed is the EQFFE operator. 

Case 1: If the two inputs to the EQFFE opera- 
tor are the same (i.e., n=1) then a “true” 
value is generated as an output data token of 
the operator. Then, the SWITCH node SW1 
passes its input value onto its output arc. 
And the output value of the DC1 graph becomes 
1s 

Case 2: If the two inputs to the EQFFE opera- 
tor are not the same then a “false” value is 
generated as an output data value of the 
operator. At the same time, the EQFFE opera- 
tor generates a control signal token, which 
forces the RS o node to be enabled. Since 
the SWITCH nodes SW2 and SW3 pass their input 
values, a value which would be chosen by the 
RSus node (and be passed by the MERGE node) 
is either n or choice(n-1) 9. Therefore, an 
output of the graph is either n or choice(n- 

1). 

From the above evaluation it is clear that the 
graph performs the operation specified by the 


program 10. 


3-2 
and 


(1) 


(2) 


In this section, we described the applica- 
tion of the DC1 flow schema for the nondeter- 
ministic programs. We considered two examples of 
the nondeterministic program. 


4, The DC1 Flow Schema for the Programs process- 
ing the Infinite Data Structures 


It has been observed that the incorporation 
of infinite data structures into programming 
languages provides the programmer with a powerful 
tool for writing structured and elegant programs 
[ArPi82]. One example of the programs using the 
infinite data structure is the “sieve” function 
which generates the prime numbers. Details of 
the function can be found in [Weng79], [Hend80]. 


However, it has been noticed that the data- 
driven evaluation of infinite data structures 
tends to usurp system resources unnecessarily 
[Dake82]. Furthermore, without a mechanism of 
enforcing the “fair scheduling’ policy among the 
enabled nodes the data-driven evaluation may not 
be safe for programs which process infinite data 


structures [ArPi82] ue 


9 If the choice(n-1) is selected by the RS. 
node, the recursion does occur. 


10 Of course, there may be a formal correct- 
ness proof for the graph. However, we will not 
pursue this topic in this paper. 


7 One way of enforcing the fair scheduling 
policy is to limit the number of tokens on each 
arc. This approach is adopted in the static data 
flow architecture in the Dennis” research group 
in the MIT [Denn80], [DeMi75]. In the MIT static 
data flow architecture each arc can contain at 
most one token. But enforcing the limitation re- 
quires much overhead in the execution time. It 
is because, in addition to the (data) tokens, the 
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On the other hand, the demand-driven evalua- 
tion is safe in evaluation of the programs which 


process the infinite data structures 12) How- 
ever, in addition to the run-time overhead men- 
tioned in section 1, there is additional run-time | 
overhead of time and memory space to evaluate the 
infinite data structures processing programs. 
This run-time overhead results from generating 
the suspensions for structure construction opera- 
tors and coercing them for the operators which 
use the structures [FrWi76]. 


The basic mechanism of dealing with the 
infinite structures in the DC1 flow schema is to 
control the infinite structures generating opera- 
tors. An infinite structure may be generated 
either by iteration (i.e., infinite loop) or by 
recursion (i.e., infinite invocation of the 
recursion). Therefore, the iteration operator or 


the recursion invocation operator is controlled 
(by the DELAY-ENABLE control signal) in the DC1 
flow schema to evaluate the infinite structure 

processing program. 


In order to illustrate the DC1 flow schema 
for the programs processing the infinite data 


structures, we use the following program 13 


integersfrom (m) = cons (m, integersfrom(m+1) ) 
getfirst (k, x) 
=: if k = 0 
then NIL 
else cons (car(x), 
getfirst(k-1, cdr(x)) 


Note that the function 


getfirst (k, integersfrom(m) ) <2> 
yields a finite list with k elements, k>0, 
although the integersfrom(m) function renders an 
infinite list. 


A DC1 graph for the function <2> is shown in 
figure 4-1. It consists of a DC1 graph for the 
‘integersfrom’ function and one _ for the 


status signal (,which is called the Acknowledge 
signal in Dennis” group,) should also be 
transmitted between operators and processed. As 
a result, a data-driven evaluation which enforces 
the fair scheduling may be inefficient. 


Ve In the demand-driven evaluation, the 
evaluation and construction of the infinite data 
structures are delayed and part of the infinite 
structures are evaluated if they are required by 
some other expression(s). Thus, the evaluation 
of the programs processing the infinite data 
structures is possible in the demand-driven 
evaluation as long as the users use a finite part 
of them. 


13 this program is from [Hend80]. 


‘getfirst” function. The FIRST and REST opera- 
tors are stream operators ; i.e., operators whose 
input/output data are streams. The EQTNE opera- 


tor generates control signal output as well as 
data output. Note that the output control signal 


are of the EQTNE node is connected to the DELAY- 
ENABLE control signal input of the APPLY1 node. 


The list generated by the ‘integersfrom’ 
function is represented as a stream internally 


14, The functions of the stream operators FIRST 
and REST are similar to the LISP functions CAR 
and CDR, respectively. (Details of the opera- 
tions of the FIRST and REST operators can be 
found in [ArTh80], [AGP 78].) The CONS operator 
is a stream operator, too. It generates an out- 
put data if its simple value input data is avail- 
able. The operation of the EQTNE node is summar- 
ized below. 


compare the two data inputs; 
if equal 
then generate “true” as data output; 


m fintegersfrom} K { getfirst 5 


d-e : DE LAY - ENABLE 
f-e+ FoR(e- EvAbLE 


Wi 


Figure 4-1. A DC1 Graph for the function <2> 


14 This stream representation allows the con- 
current operations in manipulating the list 
[Weng79]. 
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puts of the DC1 graph. 


generate “NEN” as control signal output 
generate “false” as data output; 
generate “EN” as control signal output 


else 


The initial token distribution is shown as dots 
in figure 4-1(c). 


The evaluation for the DC1 graph in figure 
4~1(c) is as follows. Suppose that we consider 
the jth instance (j = 0,1,...) of the recursion 
of the ‘getfirst”’ function. The nodes enabled by 
the initial tokens are CONS1 and EQTNE nodes. 


(1) Case 1: Assume the two data inputs to the 

EQTNE are equal. Then, | 

(a) The SWITCH node SW1 passes its input data 
onto its output data arc. Thus output of 
the DC1 graph becomes [] (or est), which 
is a null stream. It is the last element 
of the output stream of the DC1 graph. 
The APPLY1 node absorbs its control sig- 
nal token as well as data tokens because 
the DELAY-ENABLE control signal is NEN. 
It prevents the APPLY1 node from being 
enabled further. As a consequence the 
elements of the infinite list are not 
evaluated any more. 
The three SWITCH nodes SW2, SW3, and SW4 
absorb input data tokens. 


(b) 


(c) 


(2) Case 2: Assume the two data inputs to the 
EQTNE node are not equal. Then, 

(a) The SWITCH nodes SW2, SW3 and SW4 pass 
their input data onto their output data 
arcs. 

Thus a value (m+j), which is to be the 
(j+1)th element in the output stream, 
goes through the FIRST and CONS nodes. 
It then becomes the output of the DC1 
graph at this jth instance. 

The APPLY2 node is enabled and executed 
so that the (j+1)th recursion of the 

“getfirst’ function can occur. 

Also, an EN of the DELAY-ENABLE control 
signal allows the APPLY1 node to cause 
recursion of the “integersfrom’ function. 
As a consequence, the next element of the 
infinite list generated by the 
“integersfrom”’ function will be available 
in the (j+1) th recursion instance of the 
“getfirst’ function. 


(b) 


(ec) 


(d) 


Note that the infinite list is not evaluated and 
constructed at one time by the “integersfrom’ 


function. Rather, part of the infinite list is 
evaluated incrementally. In fact, each element 
of the infinite list is evaluated at each 


instance of the recursion of the ‘getfirst’ func- 
tion. Also, the generating and consuming opera- 
tions of an element of the list (i.e., stream) is 
interleaved so that more concurrency can be 
achieved. As is clear from the above descrip- 

tion, the DC1 evaluation evaluates (k+1) elements 
of the infinite list of integers which starts 

from m. The first k elements are passed as out- 
The last one element is 


absorbed by the SWITCH node SW3 at the kth 
instance of recursion of the ‘getfirst’ function. 


5. Concluding Remarks 


In this paper we proposed the DCi flow 
schema, a pragmatic parallel asynchronous compu- 
tation model. It contains two kinds of control 
signals as well as the data values. It is an 
asynchronous evaluation scheme and it can exploit 
the implicit concurrency of the computation. Its 
sequencing is based on both the control signal 
and the data availability. 


The properties of the DC1 flow schema are 
studied elsewhere [Woo 83]. There it has been 
proved that the DC1 flow schema is determinate. 


The application of the DCi flow schema to 
the nondeterministic programs is shown in this 
paper. If the (pure) data-driven approach is 
used to evaluate the program <1>, all of the n 
functions would be performed. It is because the 
input, which is x, to every function is avail- 
able. One of the results of all functions is 
randomly chosen as the result of the program. 
Now, let us compare these two flow schemas for 
the general nondeterministic program in <1>. In 
the DC1 flow schema only one function, which is 
chosen by the EN value of the DELAY-ENABLE con- 
trol signal, out of the n functions is evaluated. 
On the other hand, in the data flow schema, all 
of the n functions are evaluated. As a conse- 
quence, in general, the DC1 flow schema performs 
the nondeterministic programs more efficiently 
(using less time and hardware resources) than the 
data flow schema. In addition, the DC1 flow 
schema has better convergence property in the 
evaluation of the nondeterministic program. The 
reason is as follows. The function f in the pro- 
gram <1> diverges in the data-driven evaluation 
if any one of the n functions diverges. However, 
the function f converges in the DC1 evaluation 
even though there are some diverging functions 
among the n functions, if those diverging func- 
tions are not selected by the RSos node. 


The application of the DC1 flow schema for 
the infinite structure processing program is dis- 
cussed in this paper. It is shown that the DC1 
flow schema is safe in evaluation of the programs 
which process the infinite data structures. 
Therefore, the DC1 evaluation has advantage over 
a data-driven evaluation. Furthermore, the DC1 
evaluation need not any run-time overhead to 
evaluate those programs. Thus the DC1 evaluation 
has advantage over the demand-driven evaluation, 
since the latter requires the run-time overhead 
as discussed above. 


The technique of controlling the infinite 
structure constructing operator can also be used 
to make the for each - while loop self-cleaning 
[AGP78]. Detailed description about this can be 
found in [Woo 83]. 


It is noted that the computer architecture 
for the DC1 flow schema may be similar to the 
data flow architecture except several (minor) 
modifications [Woo 83]. Building a simulator for 


250 


the DC1 flow schema is planned. 
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Abstract -- A simple control mechanism for non- 
functional data flow languages is described. It 
is based on hierarchical structuring of data flow 
programs. The hierarchy is described 
metaphorically using the notion of "execution 
time" for program statements and blocks. It is 
postulated that statements within a block are 
executed "infinitely" faster than hierarchically 
superior statements. Resulting program structure 
corresponds to the intuitive top-down structure of 
the program. A parallel language based on this 
principle would bridge the gap between correct and 
"correctly" structured programs. 


1. Introduction 
The advantages of the top-down structuring of 
programs and the style of programming by stepwise 
refinement hardly need to be reiterated. In 
sequential von Neumann programs, top-down 
structure is more a matter of style: generally 
speaking it may not be required by the semantics 
of the programming language. In this brief paper 
I attempt to define simple semantics for a non- 
functional data flow language. The analysis of 
this problem immediately shows that one or another 
form of hierarchical program structure is 
necessary. Thus, top-down programming is optional 
for sequential programs, but is essential for 
high-level data flow programs. In Section 2 I 
discuss briefly why this is so, and in Section 3 I 
describe a data flow control mechanism based on 
the hierarchical program structure. The hierarchy 
is implemented using the metaphor of "execution 
time." It is postulated that statements within a 
program block are executed "infinitely" faster 
than hierarchically superior statements. A 
parallel language based on this principle would 
bridge the gap between correct and "correctly" 
structured programs. The problems of a plausible 
syntax for such a language are not addressed in 
this paper. A (rather naive) attempt at language 
definition including syntax is described in [7]. 


2. Two Components of Determinism 


For abstract analysis it is convenient to view 
computations as relaxation in a transition system. 
I use the term "relaxation" to refer to a sequence 
of transitions leading from a given state to a 
terminal state. For our purposes I use the 
following definition of determinacy: the system 
1s called deterministic if for any initial state 
all relaxations from this state lead to the same 
terminal state. Asynchronous data flow 
computations are possible due to the determinacy 
of computation graphs [10]. In general, in data- 
driven systems determinacy depends oon _ two 
conditions: 
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1. Availability of arguments: all designated 
inputs for a function or an activity must be 
available before the computation can 
proceed. | 

2. Preservation of results: all results must be 
made available to their "users" before they 
are overwritten or erased. 

The first condition is the essence of data-driven 

computations. There is some freedom, however, in 

the choice of a formalism to satisfy the second 

"preservation-of-results" condition. In data flow 

graphs, for example, it can be done by 

(a) providing FIFO queues on all arcs; 

(b) allowing a node to fire only when all output 
arcs are vacant; 

(c) explicit scheduling or partial scheduling of 
activities and their durations; 

(d) setting synchronized "locks" at the entrances 
of computation blocks (iterative loops, in 
particular) so that each are is_ enabled 
(carries a value) only once for each cycle 
through the "locks". 

Theoretically any one of these mechanisms could be 

used as a basis for defining the semantics of a 

high-level data flow Language. But (a) and (b) 

would result in a totally incomprehensible 

programming logic, and (c) would contradict the 
spirit of asynchronous computations. 


Option (d) is successfully implemented in the 
ID [3] and other languages [6]. This approach is 
based on hierarchical structuring of programs. A 
chunk of a program designated for repetitive use 
(e.g. an iterative loop) must be defined formally 
as a block with certain inputs and outputs. The 
inputs are synchronized through a chain of 
synchronized locks, so that only one datum can 
enter through each input arc before all output 
arcs receive the results. 


_ cer NE NNER SRR 


Painful experiences with software development 
and maintenance stimulated efforts to impose 
severe restrictions on languages. The freedom of 
programmers was curtailed by the elimination of 
global variables, GOTO statements and other 
drastic measures. The most radical approach 
advocates pure functional languages without side 
effects [4]. These efforts have resulted in 
dramatic improvements in programming productivity. 
The impetus of restrictive measures was to force 
some order and structure into conventional von 
Neumann programs. It is not obvious, however, 
that the same strict measures must be applied to 
data flow programs. First, as discussed earlier, 
proper definition of semantics of a data flow 


language may automatically require proper 
structuring of programs. Second, attempts at 
functional or "value-oriented" programming 


languages for data flow computers, such as VAL 
[2,9], are not free from. difficulties and 
compromises, especially with regard to handling 
iterations and history-sensitive computations [8]. 
Third, object-oriented programming offers a viable 
alternative to value-oriented programming: its 
semantics may be more comprehensible for a 
mathematically-naive programmer. 
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functional vs 
is beyond the 
Our objective is merely to 
‘procedural" data flow language is 
possible. Note that pure functional programs are 
top-down structured in a natural way -- [5] 
provides an example. A comprehensive review of 
properties and constraints for data flow languages 
can be found in [1]. 


A detailed comparison of 
"procedural" data flow languages 
scope of this paper. 
show that a 


My intention is to define simple semantics for 
a language with as few restrictions as possible. 
The language should allow global variables, 
partial execution of blocks under- certain 
conditions, loosening of the single assignment 
rule and same variable on both sides of the 
asSignment statement, etc. More restrictive 
semantics or programming styles can be 
superimposed on this basic language. The price 
paid for this freedom is the necessity to 
introduce the metaphor of stepwise computation 
with external timing pulses signalling the 
beginning of each cycle. 


The task of defining language semantics 
includes two subtasks. The -first is to represent 
data flow graphs in the textual form, the second 


is to define a control mechanism. The first 
subtask is straightforward. Each node of the data 
flow graph is more than simply a node: it 


corresponds to a function with some ordered set of 
arguments and one or several results. We can 
assign names to the arcs of the graph (several 
arcs may get the same name). The graph then can 
be represented as a set of nodes with labelled 
"receptacles" and "transmitters", and then 
translated into a list of statements (Figure 1). 
That is, of course, where the data flow graphs 
Originally came from. 


The single assignment rule requires that arcs 
with the same name must come out of the same node. 
This makes connections of the type 


A A 
illegal. 

It may be convenient in some cases to relax this 
constraint and permit such connections with the 
understanding that nodes Fl and F2 can fire only 
under mutually exclusive conditions, but not 
simultaneously. This, of course, would place 
additional burden of avoiding conflicts on the 
programmer. 


The second subtask is to define the control 
mechanism. It would be unwise to use conventional 
data flow control, which requires that a node can 
fire only when all output arcs are free. This 
would violate the principle of locality, since 
intermediate results can be "consumed" in remote 
places and_ times. Another alternative, as 
mentioned earlier, is explicit scheduling. The 
control mechanism proposed here is a compromise. 
It is based on the metaphor of "execution time" 
for statements. It is postulated that each 
statement takes "the same time" for execution. 
Operations proceed in a stepwise manner and are 
synchronized with an imaginary external timing 
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pulse. All enabled (possessing all arguments) 
statements are executed at each cycle, and all the 
arguments of executed statements are consumed by 
the end of the cycle. Suppose we have an 
iterative loop, and computations in the body of 
the loop take more than one cycle: 


I:=I+1 
X:=F1(1) 
Y:=F2(X) 


How can we synchronize the increment operator 

I:=I+1 (which fires at each cycle) with the body 

of the loop? To accomplish this we combine the 

statements of the body into a program block and 

postulate that the "execution time" for the block 

is the same as for peer elemental statements: 
I:=I+l 


[ X:=F1(I) 
Y:=F2(X) ] 
The statements within the block then must be 
executed faster than outside statements. In fact, 
they must operate "as fast as necessary" or 
"infinitely" faster to keep pace with 
hierarchically superior statements. The 


computation within the block is a relaxation from 


the initial state -- variables defined at the 
beginning of the (external) cycle -- to a terminal 
state where all statements in the block are 
disabled. The whole process takes exactly one 


cycle when viewed from the next superior level of 


the hierarchy. Figure 2 illustrates a data flow 
program with this control mechanism. For 
Simplicity, only one level of nested blocks is 
used in this example. At the upper level, 


statements X:=X+DX, PRINT(X,Y) and the block are 
executed at each cycle. At the lower level, 
computations within the block take 6 "subcycles". 


4. Conclusions 


Simple semantics for a data flow language have 
been described. The objective was to define 
semantics of the data flow language with as few 
constraints as possible: to allow global 
variables, partial execution of blocks’ and 
relaxation of the single assignment rule. The 
control mechanism is based on the "execution time" 
hierarchy. It turns out that for a properly 
programmed algorithm the hierarchical "“execution- 
time" structure closely corresponds to_ the 
intuitive top-down structure. Therefore the 
proposed control mechanism bridges the gap between 
correct and "correctly" structured programming. 
Hierarchical structuring also permits structured 
debugging. The programmer can monitor program 
execution step by step at any given level in the 
hierarchy. 
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(B,C) :=F4(A) 
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Figure 1 


"Receptor/transmitter" and textual representations 
of directed graphs: (a) arc labelling; (b) 
"receptor/transmitter" representation; (c) textual 
representation. 
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PF (X+DX/2, Y+W2/2)*DX Z 


PROGRAM RUNGE KUTTA(XO,Y0) 


CONSTANT DX=.1, A=1. % Initialization section 
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X:=X+DX | X<A 
[ Y:=Y+DY 
DY :=(W1+2*W2+24W3+W4) /6. 
W1:=F(X,Y)*DX 


% Execution section 


W2:=F(X+DX/2. ,Y+W1/2.)*DX 

W3:=F(X+DX/2. ,Y+W2/2.)*DX 

W4:=F(X+DX,Y+W3)*DK J 
PRINT X,Y 


ee ST NED MLE oe SD SRT ATT wee SE GD Awe “Ne SOND SpE? SONS ame: GRE cen SENS GD Se EE Seu mm SS OS ee es Sete Gt ete SS ee ei eee SD SO se) SE MD inte Ge, eum nme 
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% Comments: 
This program implements one of the Runge-Kutta 
methods for numerical solution of a differential 
equation. The initialization section serves to 
define constants and initialize variables. The 
program is executed according to the "execution 
time" hierarchy: all statements within the block 
enclosed in brackets are executed "infinitely" 


faster than statements outside the block. This 
hierarchy is used to synchronize X:=X+DX and 
Y:=Y+DY. 

Figure 2 


A data flow graph and a corresponding program with 
the "execution-time" hierarchical structure. 


A PIPELIN® MACHINE FOR IMAGE PROCESSING APPLICATIONS 
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Abstract—=The design of a general purpose image 
processing machine is currently one of the most 
active research arease This paper introduces a 
pipeline machine designed for such applications. 
The architecture of the machine can be described 
as a simplified form of a data flow machine. This 
new design combines the advantages of the fast 
throughput needed for image processing applications 
with a simple and easy=to-implement instruction 
set. To achieve these normally mutually exclusive 
features, the machine is based on a pipeline archi- 
tecture which is configurable at execution time. 
This allows the user to include only the hardware 
components which are needed for a given application. 
The instruction fields are selected accordingly, so 
the user is only concerned with the fields of the 
components he has included. The design is optimi- 
zed for array or vector manipulation. The paper 
begins with a description of some of the image 
processing algorithms. The hardware components 
needed to implement each algorithm are combined to 
form a basic data processing module. A brief des- 
cription of the microcode used in the machine is 
given. The paper concludes with a summary of the 
important features of this new design. 


Introduction 


Recently, there has been an intensified effort 
toward the design of general purpose image process-— 
ing machines. A survey of these efforts is given 
in reference [1]. Most of the existing machines 
are similar in the fact that they are based on fast 
multipliers and rely on pipeline processing to 
achieve high throughput. A major disadvantage of 
these machines is the complicated programming pro-= 
cedures that are normally needed to execute the 
desired functions. The following sections introdu- 
cé a signal processing machine that combines the 
advantages of the fast throughput needed for image 
processing applications, with a simple and easy—to- 
implement instruction set. 


Since the machine is tailored specifically for 
image processing applications, the data is represe- 
nted using a block floating point format. This 
format provides a sufficient accuracy for most 
image applications, while allowing a fast and simple 
data manipulation. The paper begins with a survey 
of the various image processing functions. Based 
on these functions, the necessary hardware elements 
are defined. The microcode used in the machine is 
introduced, and some of the basic features of the 
machine are discussed briefly. 


ourvey of Image Processing Functions 


This section surveys the various functions 
needed in any image processing machine. For a 
better understanding of these functions, the reader 
should consult references [2], [3], and [4]. 
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Digital images are large arrays of data. Pro- 
cessing these images requires the implementation 
of functions which can be classified into one of 
the following groups: 

1) Simple functions of a single variable. 

2) Simple functions of two variables. 

3) Compound functions of a single variable. 

4) Compound functions of multiple variables. 

In the following, a discussion of each group is 
givene 


Group 1, Simple Functions of a Single Variable 


These functions are defined as follows: 
Given an input array F, the output array G is 
given by 


Cae eu (1) 


Where T is any linear or nonlinear mapping 
function. The mapping function is implemented 
through a Look-Up Table (LUT), stored in a Progra- 
mmable Read-Only Memory (PROM), or down loaded into 
a Random Access Memory (RAM). 


Some of the single variable functions used in 
image processing are intensity mapping, contrast 
enhancement, and pseudo coloring. 


Group 2, Simple Functions of Two Variables 


The functions of interest in this group are 
defined as follows: Given two input arrays F1 and 
F2, the output array G is given by 


G = 0(F1,F2), (2) 


where O is any dyadic operator. The most important 
of these operators are the logical and arithmetic 
functions. An Arithmetic Logic Unit (ALU) is 
needed to implement the functions: Add, Subtract, 
OR, AND, and Exclusive-OR. Multiplication is 
implemented using a special hardware such as the 
TRW Multiplier to achieve the required speed 
performance. Division is implemented using look-= 
up tables. 


The functions of two variables are used to 
compare two images, to modify one image according 
to a spatial function described by another image, 
or in general to combine two input images into one 
output image. 


Group 3, Compound Functions of a Single Variable 


One important function that belongs to this 
group is the histogram calculation, The histogram 
is a function showing for each possible value of 
an input array Ff, the number of elements that have 
this valuee To calculate the histogram, the array 
F is stored in a RAM and used to address the 


cumulative count in another RAM. lEvery time an 
element is addressed it is incremented by one, and > 
the incremented value replaces the original value. 


The histogram is used to determine the proper 
mapping function for contrast enhancement. It is 
also used in pattern recognition. 


Group 4, Compound Functions of Multiple Variables 


Image processing functions that belong to this 
group are normally compute bound. An efficient 
implementation of these algorithms is a measure of 
the performance of any image processing machine. 
Some of the important functions that belong to this 
group are: 


oum of an Arraye Given an input array, fljceece, 
fn, the output g is defined as 


g= f1+ £2 + eo + fn. (3) 


This function is implemented using an ALU with 
Accumulator, It is used to minify an input image 
by averaging. Also, it is used to average more 
than one image for noise cleaning. 


coum of Productse Given a set of weighting 
coefficients, alyseeey an, the output g is defined 
as 


g = al*f1 + a2*f2 + eee + an*fn, (4) 


This function is implemented using a Multiplier/ 
Accumulator Unit. The sum of products is used in 
convolution, recursive filter, and interpolation. 


Data Processing Module 


Combining the hardware components needed for 
the various image processing functions, produces 
the data processing module shown in Figure 1. The 
machine consists of the following elements: 

1) Two large Random Access Memory units RAMA and 
RAMB, used to store the processed data and the 
coefficients. ! 

2) A small register file RAMF, for temporary 
storage. 

3) A Read Only Memory PLUT, that stores the inverse, 
sine, cosine, and square root tables. 

4) Two Arithmetic Logic Units ALUA and ALUBe In |. 
addition to the normal ALU functions, these two 
units have the capability to accumulate the sum 
of an arraye 

5) A Multiplier/ Accumulator Units MAC. 

6) A Scaler Unit that reduces a 35—bit multiplier 
output into a 16—bit number. 

7) A third Arithmetic Logic Unit, ALUC. 
has a divide=by=two capability. 


This unit 


The connections between the various units in 
this machine are determined by the 4-way multiplex— 
ers at the inputs of RAMA, RAMB, RAMF, Multiplier, 
and ALUC. As an example, if this machine is used 
to multiply two images, the data path is configured 
such that the outputs of RAMA and RAMB are conneet— 
ed to the multiplier input. The output of the 
Scaler is connected to either RAMA or RAMB. In 
this application, RAMF, PLUT, and the three ALUs 


are not included in the data path. If in another 
application the machine is used to add two images, 
the multiplexers are controlled such that the out- 
puts of RAMA and RAMB are connected to ALUC and 
the output of ALUC is written back to either RAMA 
or RAMB. Other applications may require more than 
one configuration of the machine. In these cases, 
all the processing that belongs to one machine 
configuration is finished before the machine is 
reconfigured. 


Introduction to Microcode 


The instruction used in this machine, is 
divided into the following eight fields: 


Field 


‘a 


Oso none WO Fb: 


SB 


ct ct ct ct ct ct ct ct © 


Function 
Sequence controller. 
Address generator control. 
Read RAMA, RAMB, RAMF, and PLUT. 
ALUA and ALUB operations. 
Multiplier operation. 
Scaler operation. 
ALUC operation. 
Write RAMA, RAMB, and RAMF. 


OnNaUE WD 


— 


Table 1. Machine Instruction Fields. 


Figure 1. Data Processing Module. 


Field 1 determines the sequence of instruction 
being executed. It controls the increment, repeat, 
or branching of the instruction sequence. Fields 
2 through 8 control the address generation and 
processing of data. The elements of the instruct= 
ion resemble those of any microcoded machine. The 
most important diference between this design and 
many existing machines is the time delays shown in 
the right column of the table. Executing a single 
instruction is not done in one machine cycle. It 
is spread over many cycles with the various fields 
being executed at the relative delays given in the 
above table. These delays allow the instruction to 
follow the data it has originally created, through 
the read command, till this data is written into 
RAM. The ability of ALUA, ALUB, and the MAC unit 
to accumulate data, allows the user to store the 
intermediate results of a summation in the machine 
registers. Thus, reducing the need to access the 
RAM units during processing. This feature is 
important in many image processing applications 
such as convolution and edge detection. It should 
be noted that the machine is a special case of the 
data flow machine, in which the data is always 
Synchronized with the instruction, and all the 
data processing elements operate at a fixed rate. 
The general case of the data flow machine is 
discussed in reference [5]. 


Also, because of the pipelined nature of the 
machine, a new instruction can be started every 
machine cycle. After the initial time needed to 
fill the data pipe , all the included components of 
the machine will be simultaneously processing 
different data elements. The overhead needed to 
fill the pipe is usually a very small percentage of 
the total processing time. At execution time, the 
instruction is modified to resemble the hardware 
configuration being implemented. This modification 
is achieved through a set-up procedure that is 
executed once at the beginning of every new config= 
uration. As an example, if the machine is config= 
ured for convolution, the set-up procedure controls 
the instruction flow such that only the fields 
related to convolution are included. These fields 
are the following: 


Field Function 
Sequence controller. 
Address generator control. 
Read RAMA and RAMB. 
Multiplier operation. 
Scaler operation. 

Write RAMA or RAMB. 


| 
= 
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Table 2. Instruction Fields for Convolution. 


In this case, the relative delay is determined only 
by the components being included in the configura-= 
tione 


The microcode needed for convolution cosists of 
three instructions: The first is Read and Multiply 
instruction; the second is Read, Multiply, and Acc= 
umulate instruction; the third is Read, Multiply, 
Accumulate, and Write instruction. The most impor= 
tant feature of this code is its compactness, This 
feature simplifies both the coding and the debugg= 
ing effort. The compactness of the code is a result 
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of the data flow nature of the machine which mini- 
mizes the interaction between various instructions. 


Conclusion 


In this paper, a general purpose image process- 
ing machine was introduced. The machine is based 
on a pipelined architecture with the various 
machine elements connected through multiplexers. 
This structure offers independent data paths, thus 
preventing the bus contention problem. In addition 
these multiplexers are used to configure the mach 
ine at execution time. Therefore only the hardware 
needed for a given application is included, and 
the data follow the shortest possible path. The 
instruction fields are modified accordingly. These 
fields control the processing sequence starting 
with reading the datas; then modifying it using the 
ALUs, the Multiplier, and the Scaler units; and 
finally writing it back. The structure can be 
described as a synchronous data flow machine. 


These features produce a machine that is as 
fast as any specialized image processing machine, 
while maintaining the flexibility of a general 
purpose array processore In addition, the machine 
is capable of handling many image processing appli- 
cations in both the spatial and the frequency 
domainse Another advantage of the machine is its 
simple and compact code, that makes the machine 
easy to programe. 
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AN EVALUATION STUDY OF SIX TOPOLOGIES 
OF PARALLEL COMPUTER ARCHITECTURES 
FOR SCENE MATCHING 
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ABSTRACT 


Six topologies of parallel computer architectures were 
evaluated for implementation of scene matching. These 
topologies are: (1) cluster-connected, (2) mesh-connected, 
(3) cluster-cluster-connected, (4) cluster-mesh-connected, 
(5) mesh-cluster-connected, and (6) mesh-mesh-connected. 
Analytic results indicate that the mesh=connected 
architecture is the best candidate. However, under- the 
constraint of the physical pin-out number, the mesh-mesh- 
connected is a good alternative. 


1. Parallel Computer Architectures 


During the past twenty years, a significant number of 
high speed computer architectures have been proposed. Some 
of these architectures that were thought to be impractical 
to build are now considered not only viable but also 
necessary for speeding up computation. This is made possible 
by the advancement of solid state technology. Signal/Image 
processing requires a large number of computations. This is 
particularly true for scene matching algorithms. Scene 
matching is the process of locating a subimage in a sensed 
image using a template CHall,79J. Scene matching algorithms 
have potential applications in cruise guidance missles 
CBerry,80]. A limitation to their use is in the intense 
computational requirements. One of the advantages that all 
scene matching algorithms have is that independence of 
computations exist in the algorithms. This independency 
offers opportunities for concurrency and thus favors the use 
of innovative parallel computer architectures for speeding 
up the processing. A tradeoff study which will be discussed 
in the following can_ show the advantages and the 
disadvantages of six promising computer architectures with 
respect to the implementation of a scene matching algorithm. 
The analysis parallels the technique used by Feather et 
al.CFeather,80). 


2. Six Promising Computer Architectures 


Six types of architectures were analysed, namely, (i) a 
cluster-connected architecture (CCA), (ii) a mesh-connected 
architecture (MCA), (iii) a cluster-cluster-connected 
architecture (CCCA)D, Civ) a cluster-mesh-connected 
architecture (CMCA), (v) a mesh-cluster-connected 
architecture (MCCA), and (vi) a mesh-mesh=-connected 
architecture (MMCA). Schematic diagrams of these six 
architectures can be seen in Figure 1Ca)-1(f). 


2.1. Analysis of the CC 


A CCA (Cluster-Connected Architecture) is an 
architecture having a common bus. Each processing element 
(PE) has its own local memory. The PEs are controlled by a 
central controller. In the following analysis, we will 
demonstrate the Limitation to this approach. 


Let the sensed image be of size 2byat while the 
template be of size, 2 xe” s Let the number of processing 
elements (PEs) be 2°x2-.° Let the image be Reactjtioped, such 
that each PE occupies 4 subimage of size 2 x2 - In 
the following analysis, it is assumed that the sensed image 
is surrounded by a boundary of width 27" ",. Hence no special 
processing is required by PEs located on the boundary 
partitions of a sensed image. It is 


template information has already been stored in the local 
memory of each PE, Therefore, no time is required to 


transmit template information to individual PEs. In order 
for each PE to perform a scene, matching algorithm,.jt s to 
acquire a subimage of size (zee Ry yom) of which 52te ks is 


already in its local memory. 


It can be shown that the total time required to obtain 
the best match is: 


2k (L—-k+m+1) ,52m 


Tecaleeke-mat ¢  C2 (2 ) 


2(L-k+m) 52k 


+ 
tbe 27°) 


This work was supported by the United States Department of 
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the Naval Air Development Center. 
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assumed that the- 
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Figure 1 Vaious architectures for analysis. 
(a) Cluster-Connected Architecture, 
(b) Mesh-Connected Architecture, 
Cc) Cluster-Cluster-Connected Architecture, 
(d) Cluster-Mesh-Connected Architecture, 
Ce) Mesh-Cluster-Connected Architecture and 
(4) Mesh=Mesh-Connected Architecture. 


implemented using current 
and e are respectively, 


Suppose the system is 
technolgy, typical values for t 
0.4 microseconds and 2 microseconds. 


Let us define the speed up factor S as the ratio of the 
time to perform the computation sequentially (TS) over the 
time to perform it in parallel. Then: 


Ss 


) (L,k,m)= ° 

CCA Toca 
Using the above values of D4 and t_, the variation of 
LogygSc¢ is plotted versus k with L=70 and m=5 (see Figure 
2). On fhe same figure, the speed up S$ per PE, which is 


defined as the PE utilization factor F, is also plotted. 


The results can be interpreted physically as_ follows. 
When k is small, that is, when the number of PEs is small, 
the time spent is mostly on computation. When k increases, 
S$ will increase because of the decrease in computation 
required by each PE. However, when k reaches a certain 
value, the time required to align the data becomes dominant 
and hence causes S$ to decrease. The implication of this 
observation is important. Suppose the size of a template is 
Large, one might want to increase the number of PEs hoping 
to decrease the computation time. The computation time 
decreases. However, the overhead is so dominating that the 
net effect will degrade the performance of the whole system. 
This is very crucial because the system is not extendible to 
meet unpredictable needs. Furthermore, on the same figure, 
one sees that the PE utilization factor F decreases 
significantly after k=6. This indicates that the PEs are 
not utilized efficiently. 


2.2. Analysis of the MCA 


The proposal for this kind of architecture was first 
given by Unger CUnger,54]. Over the past twenty nine years, 
surprisingly, only a few computer systems having this 
architecture were actually built. The main reason for such 
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Figure 2 Performance analysis of Cluster-Connected 
Architecture. 
a delay can be attributed to the complexity and cost 
involved in building any mesh-connected computer system. 
Representatives are ILLIAC IV [CBarnes,78] DAP by ICL 
CHunt,79]), CLIPP 4 ([CPreston,79], and MPP [Batcher,80]. 
Among them, the DAP the CLIPP 4 and the MPP are recently 
built machines. 
A similar analysis shows that the time required to 
locate the best match in this architecture is: 
w~cob7ktm—-1,52m 
Twca lh, km) =(2 +2 +2k)t ee 
tg 
SMCA>T 
MCA 
Using the same values of t . and t._, the variation of 
Suca and Fuca with k can be seen in Figure 3. It can be 
seen that log,.S increases Linearly with k while the PE 


O° MCA 


utilization factor F remains relatively constant from k=0 to 
k=8. When k>8, F decreases because of the overhead in 
F MCA 
aligning the data. 
In the following, we will analyze four derivations of 


the above architectures. 


2.3. Analysis of the CCCA 


diagram of the CCCA 
Connected Architecture) can be seen in Figure 1(c). 
architecture, it is assumed that there are two levels 
clusters. Each cluster at the first level has 2Px2P 
while each cluster at the second level has 29x29 PEs. 


(Cluster~Cluster- 
In this 
of 
PEs 
The 


A schematic 


advantage of this interconnection is that the communication 
Load at the first Level is reduced. A representative of 
this architecture is the CM (CSwan,77]. 
Using similar analyses, it can be shown that the time 
to complete the scene matching algorithm is: 
| t cgttmtpt 1 s2mrep | 
l xfr | 
Teceatbeprarms|4Zn TBP *GY 422m" 284277 4229) | 
ial ee ae a) | 
| | 
And hence the speed up S: 
T 
S (L,p,q,m)= 
CCCAS Thane Tea 
Since § A varies with both p and q. We assume that 


for a given 6p, q can be chosen to maximize So Ac With this 
the bE utilization 


assumption, the variation of § and 
factor Pecca with respect to p can be seen in Figure 4. It 
can be se hat ScecA has a similar trend as the Seat 

The variation of F CA with respect to p is more 
difficult to interprs€ physically. However, one can 
conclude that when p>2, there is no good way to utilize the 
PEs efficiently using this architecture. In the above 
figure, the analysis is valid up to p=9 because when p=10, 
it is not meaningful to have a second level of PEs. 
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Figure 3 Performance analysis of the Mesh-Connected 
Architecture. 
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Figure 4 Performance analysis of the Cluster-Cluster- 
. Connected Architecture. 
2.4. Analysis of the CMCA 
One may think that by changing the second level of 
architecture to a MCA then ae bétter performance may be 
attained. The following analysis shows that this 


expectation is completely incorrect. By repeating the above 


analysis, one obtains: 
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with increasing p and the PE 


very low. Therefore, this rearrangement 
one. 
2.5- Analysis of the MCCA 
It is interesting to see the effect of interchanging 
the architectures in the two levels, i.e. having the MCA at 


schematic 
With 


the first level and a CCA at the second level. A 
diagram of this topology can be seen in Figure 1(e). 
similar. analysis, one can deduce that: 
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Connected Architecture. 


to maximize the speed up S$. Then the variation of S versus 


p can be seen in Figure 6. 


It is surprising to see that by interchanging the 
architectures of the two levels, a significant change can be 
observed in performance. It can be seen that Sy c varies 
Linearly with op. It has a different slope thos the Sy Ac 
For some combinations of p and q, the FMCCA is higher Ran 


that of other combinations of p and qe This fact can be 
used in the actual design of a MCCA machine to fine tune its 
performance, 
2.6. Analysis of the MMCA 
The final architecture that will be analyzed is the 
MMCA (Mesh~Mesh=-Connected Architecture). A schematic 
diagram of this architecture is shown in Figure 1(f). The 
SMMCA can be shown to be: 
s = 's 
MMCA 
OS aa ial i a l 
ers | 
[+2°+2pe2qret (20 Crem P~ 2) 42542q) | 
The variation of LOdaGSumca and Fa with p can be 


be seen that the speed up is 
attributed to the fact 
of concurrency is already 
S at the second level, 
only one pixel to perform 
peak of F can be used to 
It can be 


In 
an 
be 


seen in Figure 7. It can 
relatively constant. This can be 
that at the second level, a lot 
made use of. In order to maximize 
each PE at the second level has 
the scene matching algorithm. The 
fine tune the system for maximum utilization. 


occurs when p=7. 
This result has 
which will 


seen from the figure that the peak 
other words, q has to be equal to 3. 
interesting implication and application 
discussed in the following. 
3. Discussion 

It is known that with current VLSI technology one of 
the major problems jis the packaging problem. The pin-out- 
number problem restricts the designer to Limit the amount of 
input information to a chip at any specific time. 
Therefore, one of the solutions is to partition a design 
such that a maximum amount of functionality can be 
implemented on the chip without increasing inter-chip 
communication. Suppose an MCA machine is to be built; it is 
obvious that one cannot put 1024x1024 PEs onto ae single 
chip. Even if IC technology permits it, the packaging 
technology will impose severe limitations on the way 
communications are to be made with the outside world. 
However, by breaking a MCA machine into two levels, either a 


MCCA or a WMMCA topology, the implementation problem will 
then be more feasible. for instance, using the MMCA, with 
q23, 1.@. by putting 8x8 PEs on one chip and by 
interconnecting 128x128 such chips to form a MMCA_ machine, 


the overall performance is comparable with a strict MCA 
machine. 

From the above analysis, we can conclude that the MCA 
is the best candidate for implementing scene matching. The 


second choice will then be either a MMCA or a MCCA topology. 
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Abstract -- Fractal surfaces have been shown to be 
a useful model for generating images of terrain in 
computer graphics. Unfortunately, the generation 


of fractal images is very costly in CPU time. A 
multi-processor architecture is described which 
takes advantage of the parallelism inherant in 
fractals to speed the generation of images. The 
performance of the processing array is analyzed 
along with the suitability of implementation in 
VLSI. 


Introduction 


Modeling Objects in Computer Graphics 


One challenging problem in computer graphics is 
the generation of images which closely resemble 
objects in the real world. Images are acceptable 
if they are readily recognizable and are not 
obviously synthesized, that is if they might be 
objects in the real world. Many approaches, 
ranging from describing objects by a collection of 

lanar . polygons to. describing them by Bezier 
[Bezi74 | or B-spline [Gord74| surface patches, have 
Although these techniques have proved 


been used. 


adequate for modeling artificial objects, most 
natural objects (such as_ clouds, terrain, 
mountains, and the like) have few regular features 


and no simple detail. The results of using these 
techniques are usually "obviously synthetic," and 
are not sufficiently realistic to be 
indistinguishable from a scene in nature. 


Other techniques have been proposed for 
modeling natural objects which partially avoid this 
problen. Objects are sometimes modeled by a 
collection of polygons with some texture mapped 
upon them. This technique suffers from the fact 
that the texture on the polygons tends to have a 
great deal of regularity, and thus objects appear 
artificial. 


Another approach is to use sufficient detail in 
the model to make the scene appear realistic with 
no added texture needed. Although this will result 
in realistic scenes, it requires very large 
databases containing sufficient data to display a 
realistic scene from any viewpoint at the maximum 
desired level of resolution. 
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All of these techniques for modeling natural 
objects have a similar drawback: they define an 
object at a single, pre-determined level of detail. 
Little computational advantage is gained from 
displaying the scene as seen from afar, as. the 
detail needed to display a close-in view must be 
stored regardless of the detail desired in the 


displayed view. In addition, these techniques do 
not allow going arbitrarily close to an object, 
since views which require more detail than the 


model contains will present the same problems as 
with the previous techniques in which the model had 
little detail. 


A common feature of most computer models of 
objects, and one of the significant drawbacks of 
the techniques previously described, is that the 
models are defined in terms of deterministic 
functions. There have. been _a few exceptions, 
however. Mezei et al. Meze74 | generated textures 

nd shapes by using random technigues, and Blinn 
Blin/7/7j| has improved shading methods by using a 
model based upon probabilistic assumptions. 


Stochastic Models 


A more general technique for the definition of 
natural objects is to use a random, or stochastic, 
process to generate detail on surfaces [ Four80, 
Four82 |. This idea is inherent_in the concept of a 
fractal surface |Mand77, Mand82|. The concept of a 
fractal surface has been shown to be an accurate 
model for many processes in nature | Four80, 
Four82 |. In particular, terrain and weather 
phenomena may be represented by a _ stochastic, 
fractal process. Images synthesized using these 
techniques are impressive for their resemblance’ to 
the real world they purport to represent. 


In this method of modeling surfaces, a basic 
model is constructed using traditional techniques, 
such as defining a surface in terms of polygons. 
The stochastic modeling method adds detail to this 
basic surface definition by recursively breaking 
the polygons into smaller, but slightly 
non-coplanar, polygons. This method of modeling 


surfaces has several advantages over traditional 


techniques. 
is 


First, the definition of an object simple. 


Complex terrain may often be modeled by only a few 
dozen polygons. The realism is added to this 
definition by the stochastic process. 

Second, the object is not defined at a 


pre-determined level of resolution. If additional 
resolution is needed for a particular scene, it may 
easily be generated by continuing the recursive 
texture generation algorithm to a greater depth. 
Thus, moving the viewpoint closer to an object or 
further away from it is easily handled. Because of 
this ability to generate texture to whatever level 
of resolution desired, this technique has_ the 
Significant advantage of requiring computational 
effort proportional to the complexity of the image 
on the screen. It is not necessary to compute or 
store texture which is not needed for the 
resolution desired. 


Unfortunately, even though the effort required 
proportional to the on-screen image complexity, 
still very 
complexity 
of 


is 
the generation of a fractal surface is 
expensive, with scenes of moderate 
requiring between 30 minutes and several hours 
CPU time for a single frame. 


Because of the complexity of generating images, 
fractal surfaces cannot now be used in applications 
which require real-time generation of images. 
Although their use in simulators would make the 
trainee's view of his "world" more realistic, the 
computational effort required would be prohibitive. 


Even in non-real-time applications, such as the 
generation of motion pictures, the use of fractal 
surfaces is limited. To generate motion pictures 
using these techniques is, for the most part, 
infeasible, due to the cost of generating thousands 
of such frames. A few organizations do have the 
necessary computational power to generate movies 
using these techniques, and the results have been 
impressive, for example the movie "Star Trek II", 
but the use of these techniques remains severely 
limited. 


Architectures for Graphics Computation 


To make possible the real-time use of fractal 
surfaces, or the large scale generation of motion 
pictures using fractal generated terrain, it 
clearly is necessary to find a method of generating 
fractal surfaces more quickly than is now possible. 
One possible approach to this problem is to find 
multi-processor architectures which are suited to 
the efficient generation of fractal surfaces. 


We describe a new multi-processor architecture, 
suitable for implementation in VLSI, which is 
capable of generating fractal surfaces much more 
rapidly than existing software fractal systems. 
Preliminary estimates indicate a reduction in the 
time required to generate a frame by at least two 
orders of magnitude, and perhaps a _ sufficient 
reduction to allow the generation of fractal images 
in real-time. 


Although a number of experiments with 
multi-processor based systems for computer graphics 
have been conducted, none have investigated their 
use in generating fractal textures for surfaces. 
Moreover, the architectures which have been 
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proposed have been based upon rectangular grids. 


This is reasonable for most applications in 
graphics, since objects tend to be defined in 
rectangular coordinate systems, and _ since the 
eventual display is, in almost all cases, a 
rectangular screen. 

The generation of fractal surfaces is most 


Suited, however, 
hexagonal ones. 


not to rectangular grids, but to 
This is primarily due to the way 
in which a fractalization algorithm decomposes a 
surface (which we will define in terms of 
triangles) to generate the fractal surface. The 
proposed architecture arranges processing elements 
on a hexagonal array, with interprocess 
communication between adjacent processors and 
across certain diagonals. The interconnection 
network, although complex, does describe a planar 
graph. Thus the architecture may be implemented 
easily in VLSI. 

The regularity of the processor connections 
also gives the advantage of easy expandability. A 
fractal processor may be increased in power and 
speed by interconnecting several processing units 
in the appropriate manner. Thus the computational 
power of a fractal processor may easily be tailored 
to the application. 


In the next section, we will describe fractal 
surfaces and the techniques used for generating 
them. We then will show how a multi-processor 
architecture can be developed which takes advantage 


of the regularity of fractal surfaces to generate 
fractal surfaces efficiently. Finally, we will 
examine some of the performance issues associated 


With this multi-processor architecture, and discuss 


some of the preliminary results of our performance 
studies. 
Fractal Surfaces 
Generation of Fractal Surfaces 
The generation of a fractal surface usually 
starts with a coarse description of the object in 
terms of triangular polygons*. Only the general 


shape of an object usually needs to be specified, 
since all detail is added by the texturing 
algorithm. 

The triangular definition of the object is 
broken into smeo’er triangles which are slightly 
non-co~planar. u.iese smaller triangles are then 
broken into even smaller triangles, with the 
Subdivision process continuing until the triangles 


generated are no larger than a pixel in size. At 
that point, the pixel-size triangles may be painted 


onto the screen. The color selection for each 
pixel is based upon the color of the original 
triangle and its orientation using conventional 
techniques. 


The processing of an object definition is most 
easily described in terms of a recursive procedure: 


* Although surfaces intended for fractalization may 
be described by quadrilaterals as well, we will 
restrict our inquiry to triangular definitions. 


procedure fractalize( triangle ); 
begin 
if the triangle is less than a pixel in size 
then paint the triangle onto the screen 
else 
begin 
for each edge of the triangle do 
begin 
find the mid-point of the edge 
pick a point a random distance from the 
edge midpoint in the direction of 
the normal of the triangle* 
end 
connect the three displaced midpoints 
connect each displaced midpoint with the 
vertices of the original triangle 
adjacent to it 
fractalize each of the resulting triangles 
end 
end 
begin 
for every triangle t in the object definition do 
fractalize( t ) 


end 

This procedure recursively subdivides each 
triangle in the object definition into smaller 
triangles, all approximately similar, but slightly 


Thus the procedure is replacing the 


non-coplanar. 
progressively 


object definition with one which is 
more textured. 


Figure 1 
Subdivision of a triangle 


Anomalies in Fractal Surfaces 


algorithm, although simple, has an 
property: it generates fractal 

anomalies. If the fractalization 
carried out on two adjacent triangles 
there is no guarantee that’ the 
fractal surfaces will meet along the common edge 
that the triangles’ share. In fact, using this 
technique, it is guaranteed that the surfaces will 
not meet, since the displacements for each have 
-been made parallel to their respective normals. 
This kind of anomaly may be seen in Figure 2. 


This 
unfortunate 
surfaces with 
process is 
independently, 


*” Or in a direction related to the normal - see 


below. 
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Figure 2 
Anomalies in triangular fractals 
A solution to this problem is to require that 
the displacements for the new mid-point of each 
edge be in the direction of the average of the 


normals of the two triangles which share that edge. 
If, in addition, it can be guaranteed that the 
distance of the displacement will be the same for 
each of the two triangles, the fractal surfaces 
resulting from the two triangles will match, and no 
gaps or other anomalies will result. For an 
example of how this technique works, see Figure 3. 
The vector indicated from the midpoint of the 
common edge is in the direction of the average of 
the surface normals of the two triangles. 

These requirements, although sufficient to 
guarantee that fractal surfaces are free from 
anomalies, do pose difficult problems for. the 
implementation of a software fractal system. These 
conditions must be made to hold without imposing 


prohibitive memory or computational requirements. 
These difficulties are avoided in our fractal 
architecture by allowing the processors to 
communicate and to come to a mutual agreement 
concerning the size and direction of the 
displacements. 

Using the techniques described here, fractal 


surfaces may be generated for arbitrary surfaces 
defined using triangles. The surfaces are 
guaranteed to have no gaps or other anomalies. 


Figure 3 
Subdivision of two triangles 
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An Architecture for Generating Fractal Surfaces 
Introduction 


A study of the generation of fractal surfaces 


will reveal that a great deal of the work may be 
done in parallel if an appropriate multi-processor 
architecture is used. The multi-processor 


architecture we propose will gain its 
assigning one processor to each 
subdivided. Once a triangle has 


speed from 
triangle to be 
been split, the 


processor responsible for that triangle activates 
three new processors, assigns one of the generated 
triangles to each of them, and processes’ the 
remaining triangle itself. Kach of the four 


processors then subdivides its triangle in the same 
manner used by the initial processor to subdivide 
the first triangle. Thus the number of processors 
active is equal to the number of triangles created 
by the division process, and the number of 
processors required to completely fractalize one 
triangle in an object definition is equal to the 
number of pixels that triangle will cover on _ the 
screen. 


Although the number of processors required is 


large, the individual processors are very simple. 
Using the increasing densities of VLSI 
technologies, it may be possible to place several 


hundred processors on a single custom chip. Thus a 
fractal generator of several thousand processors 
may be constructed using a small number of chips. 


Geometry of a Fractal Processor 


Triangles to be processes come in many shapes, 
but for the purpose of subdividing them into 
smaller triangles their shape is immaterial. Since 
the fractalization process automatically stops 
subdivision when pixel-sized triangles are 
generated, the fact that a triangle is not 
equilateral will not be of concern. Only the 
topology of the triangles matters, and that is 
regular. Thus the geometry of a multi-processor 
fractal generator is regular. All processors 
reside at the intersections of a regular hexagonal 
grid, with communication paths along the grid and 
across certain diagonals of the grid. 
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Figure 4 
Processor arrangement 


The arrangement of processors for the first 
three levels of the fractalization process is shown 
in Figure 4. The central processor, numbered 1, 
initiates the processing of the triangle. It 
subdivides the triangle, and spawns the three 
processors numbered 2. These processors and the 


original processor in turn subdivide their 


triangles and spawn the processors numbered 3. 


In addition to the parent-child communication 
paths shown by solid arcs, it is necessary for 
certain "cousins" to communicate to ensure that the 
triangles they generate will meet along their 
common edge. Thus, in addition, there must be the 
communication paths shown by dashed arcs. Although 
communication paths may form the diagonals of 
hexagons, it is never necessary for them to cross; 
the graph described by the processor 
interconnections is planar. 


If we examine the pattern of processors to a 
greater depth, we see a potential problem. For 
example, consider the pattern of the children of 
processor 1 to a depth of 4, as shown in Figure 5. 
Some processors at level 4 lie on the communication 
path from 1 to its level 2 children. If it was 
necessary for 1 to communicate with both a level 2 
child and a level 4 child at the same time, 
communication could become a serious problem, since 


these communication paths run in the same 
directions. 
2 
3 4 3 
4 7 ee 
ie | 2 
3 
Figure 5 


Children of processor 1 


If the inter-processor communication needs at 
the various levels are examined, is seen that this 


case does not arise. If multiple communications 
are required along a_ single ray, the paths are 
completely disjoint. We will require that’ the 


communication paths be between adjacent processors 
only, and that messages which must travel a greater 
distance, such as 1 to 2, be relayed by each 
processor along the path. 


Processor Operation 


since the 
processors are 
processor array within a single 


operations performed by all 
the same at any point in time, the 
chip may be run 


synchronously, with all processors running in 
lock-step. If an application requires a processing 
array of more than one chip, the speed of the 


communication lines between chips may require that 


the different chips run asynchronously. 


We will consider the processors to be in either 
of two phases of computation*: subdivision and 
broadcast. During the subdivision phase, each 


* Three phases, if we include reading out the final 
data; see below. 
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processor subdivides the triangle it is currently 
working on. There is some interprocessor 
communication involved since processors sharing an 
edge must ensure that they pick the same point as 
the new mid-point. This communication is only 
between processors which have no active processors 
between them. 


dG 
two 
will 


To ensure that no anomalies are 
must be possible to guarantee 
processors subdividing adjacent 
choose a displacement in the direction of the 
average of the two normals. This may be 
accomplished by having each compute the normal to 
its triangle and send its computations to its 
neighbor, and then having both processors average 
the results of its computations and its neighbor's. 
At the end of this sequence, both processors have 
computed the average of the normals. 


generated, 
that the 
triangles 


If only single frames are to be generated, this 


exchange of normals is sufficient to remove all 
anomalies. If motion pictures are to be generated, 
however, it must be possible to guarantee that, in 


successive frames of the film, the direction and 
distance of the displacements will be the same. If 
not, the features introduced by the fractal process 
will change from frame to frame, and the surface 
will move as a result. This problem may be removed 
by having the seed of the random number generator 


be the coordinates in object space of the midpoint 
of the line to be divided. Thus the random numbers 
are independent of eye position. It can in this 


way be guaranteed that the same surface will result 
every time the object is processed. 


The second phase of computation is the 
broadcast phase, when newly created triangles are 
sent to the processors which will subdivide them. 


This communication may be done by observing that 
the distance that a triangle definition must be 
sent to reach the processor which will be 


resposible for it is uniform at a given level. We 
may then use a broadcast phase which consists of 
the appropriate number of "relay-triangle" pulses 
to all processors. If each processor relays the 
messages in the appropriate direction, the triangle 
definition will reach the correct processor when 
the last "relay" pulse arrives. The array then 
returns to the subdivision phase, during which the 
new allocation of triangles is broken up. 


This sequence of subdivision followed by 
reassignment continues until all of the created 
triangles are less than a pixel in size. This may 


be detected by having each processing element send 
to its parent a message when it finds that’ the 
triangle it is to subdivide is pixel-sized. Once 


the parent has received such a message from each of 
its children, and its own triangle is pixel sized, 
it sends a similar message to its parent. Thus the 
central processor (level 1 processor) can be 
notified that no more subdivision is required, and 
that the data is ready to be extracted. At this 
point, the data describing the fractal triangles 
may be used to determine shading for the 
corresponding point on the screen. To accomplish 
this, some method must be available to extract the 
data describing the position and orientation of the 
fractal triangles from the processors of the 
fractal processor. 


A solution to the problem of extracting data 
from the array is to notice that communication 
paths already exist between the processing 
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elements. If the existing paths can be used, the 
complexity of the processor can be substantially 
reduced. 


We will therefore add a data extraction phase 
to the processing. All data computed can _ be 
extracted by simply reversing the direction of flow 
of the existing data paths. Details of this 
technique will be described later. 


In use, the host will load a single triangle 
definition into the central element (level 1) of 
the fractal processor. The host then issues a 
command to the array that processing should 
commence. The fractal processor then runs 
independently of the host until processing of that 
triangle is complete. The fractal processor then 
issues a signal to the host indicating that, and 
the data extraction commences. 


Performance 


Restrictions and Assumptions 


Since the goal of this work is to develop a 
processor architecture capable of generating 
fractal surfaces at high speed, we must pay 
considerable attention to the question of how fast 
this proposed architecture actually can run. 
Although exact numbers cannot be predicted without 


a great deal of experimentation, some estimates of 


performance can be made. 


Before making these estimates, we must first 

some assumptions which will simplify the task 
of estimation. These concern whether the fractal 
processor must be capable of performing floating 
point arithmetic or only integer, and the number of 


make 


processing elements in the array and its 
relationship to the size of triangles’ to be 
processed. 

The goals of the arithmetic system used are 
two-fold, and somewhat contradictory: speed and 


accuracy of computation. That speed is required is 


obvious if pictures are to be generated quickly. 
Accuracy is also required if errors induced by the 
number system are not to appear in the final 
pictures. 


Although floating point numbers have good 
accuracy and can represent a wide range of numbers 
Simultaneously, they suffer from severe performance 
problems. Because of this, floating point numbers 
are not appropriate for this system. 


Integer number systems and arithmetic units, on 
the other hand, are much faster and simpler than 
floating point units, and are thus much more suited 
to use in this system. An integer number system of 
24 to 352 bit numbers can handle the computations 
required. 


If the triangles to be processed are too large, 
the fractal processor will be unable to completely 
decompose them before running out of processors. 
If this occurs, the fractal processor must stop the 
subdivision, unload all triangle definitions from 
all processing elements, and restart the array on 
those triangles, one-by-one. Not only does this 


eliminate the parallelism of processing triangles 
Simultaneously, it also imposes the considerable 
overhead of unloading the processor. If the 


requirements for processors are examined, we find 
that this can be guaranteed not to arise if the 


triangles sent to the 


requirement that: 


processor meet the 


log(longest side of the triangle) 

<= depth of the processor array 
where the size of the triangle is measured in 
pixels, and the log is taken base 2. 


Prior to sending any triangle to the processing 
array, the host machine may apply this test and 
subdivide the original triangle into several 
smaller ones which meet this requirement. Since 
this is being done in the host, it may be done in 
parallel with the decomposition of other triangles 
by the fractal processor. Thus this requirement 
will not impose a serious burden upon the system. 


Given this restriction that guarantees that the 
fractal processor will be capable of subdividing a 
triangle into pixel-sized triangles without 
interruption, the time required to completely 
subdivide one triangle from the object definition 
is bounded by 


depth * ( subdivision + broadcast ) + readout 
where depth is the depth of the processor array, 


Subdivide is the time required for one triangle 
subdivision, broadcast is the time for one triangle 


relay, and readout is the time to extract the final. 


information from the processor. We now must 
examine each of these time requirements in detail. 


Subdivision Phase 


The subdivision of a triangle requires’ that 
each processing element compute the normal to its 
triangle, the average of the normal of its triangle 
with the normal of adjacent triangles, the 
midpoints of each edge, and three random numbers. 


The cost of the random numbers we will ignore, 
but should not be significant compared to the other 
costs of subdividing the triangle. Since the 
random numbers must be a function of the position 
of the midpoint (to ensure repeatability, as 
mentioned earlier), they may be computed by a 
combinatorial circuit using the <x, y, 2> object 
space coordinates of the midpoint as inputs. 
Although such numbers may not meet strict 
mathematical tests for randomness, they are 
sufficient for our purposes. 


The processor then exchanges the normal and 
random number information with each of its three 
neighbors. The processor averages its normal with 
the normals received from each of its neighbors, 


and averages the random numbers to determine the 
distance of displacement for each of the new 
midpoints. The new midpoints then are computed. 


Once this is accomplished, the processor only needs 
to send the computed information to the appropriate 
processors for further subdivision. 


Many of the operations required to subdivide a 
triangle may be performed in parallel. If this is 
done, and the hardware has the needed parallelism 
to perform computations on edges simultaneously, 
the number of serial operations required to 
subdivide a triangle is 6 multiplications, 22 
additions and subtractions, and 10 shifts. The 
time required to perform these operations depends 
heavily upon the technology used to implement then, 
but reasonable estimates can be made based upon 
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commercially available micro-processors and other 
devices, and are shown in Table i. 

MOS Bipolar ECL 
Add 1 micro-sec. O.% micro-sec. O.1 micro-sec. 
Shift 1 micro-sec. O.% micro-sec. 0.1 micro-sec. 
Mult. 9 micro-sec. %4.5 micro-sec. 1.0 micro-sec. 


Table 1 
Performance of Integrated Circuits 


Using these estimates, it can be seen that the 
task of subdividing a triangle can be performed in 
less than 86 micro-seconds. To be pessimistic in 
the computation of an order-of-magnitude figure, we 


will assume this step requires 100 micro-seconds 
using MOS technology (35 micro-seconds using 
Bipolar, or 10 micro-seconds using ECL). 
Broadcast Phase 

The time required to broadcast the triangle 
definitions to the processors which are _ to 


subdivide them is not easily computed. The primary 
difficulty is that the time required depends upon 
the inter-node communication technique. Lt 
processing nodes must communicate via a serial data 
path, the time requirements will be substantial; 
if the communication is via a parallel, or even via 
mixed serial-parallel, the overhead of broadcast 
may be greatly reduced. To get estimates of this 
requirement, we will examine several possible 
communication techniques. Only after we have some 
estimates of the node-to-node commmunication times 
can we examine the total time required to send 
triangle definitions to their destinations, which 
in general will not be the adjacent processing 
element. 


Serial Links. If serial links are used, all 
data defining the triangle to be subdivided must be 
sent in serial fashion over a single line between 
processing elements. Bach triangle to be sent 
requires the sending of three points, each of which 
consists of three values. If the values are 
defined by 32 bit integers, the transmission of a 
Single triangle will require 288 bits transmitted. 
Although each processing element must send triangle 
definitions to three children, this can be done in 
parallel, if maximum hardware parallelism is used. 
If we assume that on-chip serial communication 
links can transmit information at a rate of 10M 
baud, the transmission of a triangle definition 
from one processor to the adjacent one will take 
28.8 micro-seconds. 


Mixed Serial-Parallel 
technique, several lines run in parallel between 
adjacent processing elements. The data to be 
transmitted is sent in several stages, with pieces, 
for example 8 bit wide sections, sent in parallel. 
Although the same number of bits must be 
transmitted, this technique will reduce the number 
of transfer cycles from. 288, with pure serial, to 


Links. Using this 


36. Thus this method will reduce the node-to-node 
transfer time by a factor of 8 to 3.6 
micro-seconds. Since speed is critical, this may 
be worth the extra communication lines involved. 
Parallel Links. Due to the number of bits to 
be transmitted a pure parallel transmission system 
is impractical. This would require 288 lines 
between each pair of adjacent processing elements. 


It may be possible, however, to have integer-wide 


communication paths, capable of transfering 32 bits 
in a single transfer cycle. This technique would 
require only 9 cycles to transfer one triangle 
definition, and thus would take only 0.9 
micro-seconds, but would be extremely expensive in 
terms of communication lines (and thus in chip 
area). 


Multi-Chip Implementations. Although the mixed 


serial-parallel scheme may _ be possible for 
communication between processing elements on a 
Single chip, attempting to transfer data between 
elements on separate chips using it would be 
impossible. Such chip-to-chip transfers require 
the transmission of data between many pairs of 
processing elements (see Figure 6); the pin-outs 
required would be tremendous. For that matter, the 


mixed serial-parallel technique requires too many 
lines for transmitting data between chips. Thus 
the pure serial technique is probably the only one 
feasibible, and that at a substantially reduced 
bandwidth. 
Figure 6 
Chip-to-chip Communication 

Distance of Transmission 

The appropriate processor to subdivide a 
triangle may be several nodes away along the 
communication path. It is necessary for triangle 
definitions to be relayed by elements along the 
path from the generator of a triangle to _ the 
processor which is to subdivide it. The time 
required to broadcast triangle definitions to the 
appropriate elements is the product: 

number of relays * time per relay. 

The number of relays required decreases as a 
processing proceeds since the processor which 
Should subdivide a triangle is closer to the 
processor which generated it. For the purposes of 
a coarse estimate, however, the depth of the 
processing array is a bound for’ the number of 
relays. 


Extraction Phase 


The final crucial stage in generating a fractal 
surface is to extract the computed data from the 
fractal processor array. This may be accomplished 
by reversing the direction of data flow from that 
used during the computation stage of processing. 


Instead of triangle definition information 
being sent from a level i processor to a level itl 
processor, the computed data is sent from level itl 
elements to level i processors. The level i node 
then sends the data on to its parent, until the 
triangle definition reaches the level 1 element, 
Which sends the data on to the host cpu for 
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display. The 


time required for the readout phase 
is bounded by: | 


#elements in the array * time for one transfer. 


As an estimate, we may assume that the time 
required for one transfer is the same as that 
required for a relay of a triangle definition. 


Although this technique is slow, it has the virtue 
of simplicity. All the needed communication paths 
exist, and the added complexity to each processor 
is slight. 


Overall Performance 


Using the estimates developed in the previous 
several sections, and noting that the number of 
elements in an array is 4**(depth - 1), the time 
required to generate and extract data for a single 
triangle from the definition is: 


depth * (subdivision + depth * relay) + 
4**(depth - 1) * relay. 


Using our estimate of the time to relay a 
triangle definition using the mixed serial-parallel 
interface, the total time required to process one 
triangle from the object definition to pixel sized 


triangles may be computed. Some of these times are 
summarized in Table 2 for processing arrays of 
various sizes and for various implementation 
technologies. It should be noted that a speed-up 


by a factor of three has been assumed for relay 
times for Bipolar, and a factor of three speed-up 
over that for ECL. 


Depth = 4 6 8 
Pixels Covered = 64 1024 16, 384 
MOS 0.688 msec. 4.416 msec. 60.012 msec. 
Bipolar 0.236 msec. 1.482 msec. 20.017 msec. 
ECL Q.O072 msec. O.484 msec. 6.659 msec. 


Table 2 
Object Triangle Fractalization Times 


It may be noted that, for increased depths, the 
time required for transfer of data out of the array 
becomes the primary delay. Although this may be 
reduced by using a more complex data extraction 
technique, the limiting factor for the speed at 
which the data may be extracted is the speed of the 


host, not of the fractal processor. The fractal 
processor is capable of supplying data at rates 
exceeding that at which the host can compute 


shading and place the data into the frame-buffer. 


Conclusions 


Great interest is being shown in generating 
images of terrain using fractal surfaces. Such 
images are often indistinguishable from natural 
terrain, if seen at comparable resolution. This 
technique shows great promise for the generation of 
realistic scenes of great complexity. 


Fractal surfaces have several advantages over 
other techniques for modeling in computer graphics. 
Object definitions may be very simple and contain 
little detail; all detail in the final image is 
generated by the fractal process. In addition, 
images may be generated to any level of resolution 
desired; no limit is imposed by the object 
definition. 


Unfortunately, fractal algorithms suffer from 
the major drawback of being computationally very 
expensive. Until this barrier is overcome, the use 
of fractal images will be limited to still images, 
or to motion pictures generated only by those with 
the large computing resources needed. Their use in 
real-time applications, such as simulators, will be 
impossible. 


Much of this problem with fractal surfaces 
stems from the fact that the computer must generate 
the pixel-sized triangles of the image one at a 
time. If the technique of generating them is 
examined, however, it may be seen that much of the 
work may be done in parallel, if the appropriate 
multi-processor architecture is used. 


This paper proposes an architecture designed 
for the generation of fractal surfaces. The 
processor array consists of a large number, perhaps 
several thousand, processing elements, each of 
which is designed to subdivide one triangle. These 
processing elements are connected via a hexagonal 
grid, with communication lines along the lines of 
the grid and across certain diagonals. 


This grid architecture forms a planar graph, so 


its implementation in VLSI is easy. With such an 
implementation, several hundred, perhaps a 
thousand, elements could be placed on one chip. 
Thus a complete fractal processor. could be 
implemented in a handful of chips. 

The advantages of this arrangement are many. 
This architecture would make possible the 


generation of fractal images of moderate complexity 


in a few seconds. This is at least 2 orders of 
magnitude faster than current software-based 
systems. 


This architecture is simple. Each node in the 
processing array is functionally identical to every 
Other node. The only differences lie in their 
communication links to other nodes. Moreover, the 
nodes only need to be able to perform a_ simple 
function - the subdivision of a single triangle. 


The architecture is easily expandable. The 
fractal processor may be increased in power and 
performance by replicating the basic unit four 
times, and placing appropriate communication links 
between the edges of the units. The fractal 
processor so obtained appears to the host computer 
to be identical to the original, smaller unit, 
except that it can proces larger triangles. 


The architecture described here makes possible 
the generation of fractal surfaces with reduced 
computational expense, and will thus make fractals 


much more useful in a range of application from 
motion pictures to high-performance aircraft 
training simulators. 
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Abstract 


A special purpose multiprocessor architecture 
has been developed which facilitates the high 
speed display and manipulation of shaded three 
dimensional objects or object surfaces on a 
conventional raster scan CRT. The reconstruction 
algorithms exploit the capability to divide object 
space into totally disjoint cubical regions 
permitting multiple display processors to access 
independent memory banks concurrently without 
conflict. All of the geometric parameters 
describing rotation, translation, and scaling are 
incorporated into one short table facilitating 
very rapid manipulation of the image display 
presentation. 


Introduction 


The desire to generate realistic 
presentations of three dimensional objects has 
stimulated a great deal of interest in special 
purpose hardware and software systems. 
Applications for such technology include 
industrial simulation, three dimensional 
modelling, and medical imaging for clinical 
diagnosis. Currently, systems that operate in 
real time (i.e., 15-30 frames/second or more) fall 
primarily into two classes: The first are based 
on random scan display generation and thus provide 
only wireframe images with little realism. Other 
techniques based on polygons or other geometric 
primitives are being developed for use in such 
systems as high performance aircraft simulators. 
Systems of this type are not entirely suitable for 
images derived from experimental data but are 
being increasingly utilized for synthesized 
computer graphics. 


Software based systems which generate 
realistic images of natural structures are 
extremely slow. These include the DISPLAY 
software package developed by Herman et al [1]-[4] 
for medical applications; other systems have been 
proposed independently [5]-[7]. The Lexidata 
SOLIDVIEW system [8] is a commercial product which 
combines hardware and firmware to offload hidden 
surface removal and shading algorithms from user 
software. Even with such hardware assist, a 
single view may take several minutes to be 
generated. 


One important application of such a system is 
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in the area of medical image processing using CAT, 
PET, or NMR scanning and reconstruction 
techniques. A system permitting a physician to 
visualize and interact with a shaded 3-D 
representation of an organ would greatly 
facilitate the examination of anatomical 
structures in conjunction with medical research, 
clinical diagnosis, and the planning of surgical 
procedures. 


In this paper we present one possible 
architecture which will permit the high speed 
display and manipulation of solid objects 
represented as a voxel (volume element) database 
with grayscale. Our objective is to provide 
certain capabilities at or near video rates 
facilitating extensive real-time interaction. The 
architecture is highly modular permitting a cost 
tradeoff to be made to achieve a given level of 
performance. It also includes a great deal of 
regularity in its structure making it directly 
suitable for VLSI implementation. A key feature 
is that no computational operations more complex 
than adds, shifts, and comparisons are required in 
real time. 


The DISPLAY Algorithm 


The overall display processor architecture is 
based on the DISPLAY software package described in 
[1] and utilizes modified versions of those 
algorithms. The modified software generates 
surface views by mapping 3-D object space into 2-D 
image space using either a Z-buffer or equivalent 
time-ordered display procedure for hidden surface 
removal. In the latter version, for a given 
orientation of the object, pixels are written into 
the 2-D image (display) buffer in time-order 
corresponding to reading out voxels from the back 
to the front of the object. This "painter’s 
algorithm" guarantees that any point that should 
be obscured by something in front of it will in 
fact be invisible in the final reconstruction. 


Figure 1 illustrates a simple two dimensional 
analog of the back-to-front readout technique. It 
can be seen that for any orientation (rotation) of 
the object, there exists a readout sequence (and 
hence a processing time sequence) such that voxels 
early in the sequence (which should be hidden) 
will be overwritten by voxels later in the 
sequence. This architecture addresses the 
recursive decomposition of the sequence in such a 


way that (1) a near real-time update rate is 
possible, (2) common geometric modifications are 
instantaneous, and (3) a modular structure is 
created. 
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In order to reduce the problem of real-time 
display of 3-D objects to manageable proportions, 
it is necessary to partition either the input 
(object) space or output (image) space - or some 
combination of these. In a multiprocessor 
implementation, partitioning input space and 
assigning each partition to a separate processor 
will avoid object memory access conflicts, whereas 
partitioning output space will avoid image memory 
access conflicts. The former technique is clearly 
superior and will minimize conflict since a 
substantial amount of data reduction occurs in the 
projection from 3-D to 2-D space. 


Representation of 3-D Objects 


We assume the input to the system (object 
space) is a 3-D scene subdivided by three sets of 
parallel planes into cube shaped volume elements 
or voxels as shown in Figure 2. While the 
algorithms which will be described can be 
generalized to any uniform parallelepipeds, 
throughout this paper we will generally assume a 
cubical object space. Furthermore, note that the 
voxel dissection is a special case of a general 
representation based on convex polyhedrons [9]. 


Associated with each voxel is a numeric 
quantity called the density which may correspond 
to color or brightness, or some other point 
property of the object. One use of density is to 
distinguish between different and distinct objects 
or parts of objects making up the input scene. A 
natural data structure for such a scene is a 3-D 
array, indexed by X, Y, and Z where the value of 
each element is the density of the corresponding 
voxel. A binary (two level) object would require 
one bit per point. Typically, however, the 
storage format is one byte per point supporting up 
to 256 density levels. | 


Such a 3-D array is spatially presorted in 
the sense that for any viewpoint, voxels can be 
read out and displayed in a sequence which 
' guarantees that voxels retrieved early in the 
sequence cannot obscure voxels retrieved later in 
the sequence [10]. This property leads to the 
simple method of hidden surface removal described 
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above. However, no data compression is achieved. 
Two other possible data structures that can 
be used are octrees [7] and unsorted lists of 
voxels. The octree representation has the same 
spatial presorted property as the 3-D array. 
Octrees achieve excellent data compression when 
large regions of the scene contain the same 
density as, for example, in a binary scene. 
However, in our experience, the advantages for 
real world (particularly medical) objects are more 
than offset by the computational overhead 
associated with traversing the tree structure. 


The second method is to store the voxels in 
random order, using 4 locations for each voxel (X, 
Y, Z, and density). This method is advantageous 
when a single small object has already been 
separated from the surroundings by means of 
thresholding or segmentation. However, a true 
Z-buffer is required for hidden surface removal. 
We have not found it appropriate for a real-time 
system capable of displaying entire scenes. 


Display Processor Organization 


The basic hardware realization of the DISPLAY 
algorithm consists of five components as 
illustrated by the block diagram in Figure 3: 


64 
Processing 
Elements 


8 
Intermediate 
Processors 
w/Buf fers 


Output 
Processor 
w/Buffers 


Raster 
Scan 
Display 


Figure 3 - Overall Display System Arch tecture 


* Display processor array (64 PEs), each with 
associated object subcube memory module, and 
double 128x128 image buffer. 


* Intermediate processors (for groups of 8 PEs) 
feeding double 256x256 intermediate image 


buffers. 


* Output processor (for group of 8 intermediate 
buffers) feeding double 512x512 image buffer. 


* Video postprocessor and video interface. 


* Host computer interface and microprocessor 
based system controller. 


Briefly, the processing strategy consists of 
the following. The processing elements (PEs) 


compute the 2-D subimages from each 64-subcube 
(64x64x64 voxels) of the overall 256-cube input 
object. Each PE contains a double buffer, each 
half of which is sufficient to hold the largest 
image that can be created from its associated 
64x64x64 cube. 


The reconstructed image will consist of two 
components. The first of these is the density of 
each active point in the object - those which have 
not been removed through thresholding, for 
example. Depth or Z coordinate - the distance 
from the point to the front end of object space - 
is buffered also for use by the shading 
postprocessor. 


Each of the eight intermediate processors 
merges the 2-D subimages generated by its set of 3 
PEs into the appropriate position in the eight 
intermediate double buffers following priority 
rules determined by the sequence control table 
(see below). Finally, the contents of the 
intermediate buffers are merged into the double 
512x512 frame buffer, following the same priority 
rules. The two halves of the double frame buffer 
are filled alternately - one is computed while the 
other is displayed. Postprocessing consists of a 
global tone scale lookup table, shading algorithm 
implementation, and final brightness and/or 
pseudo-color lookup table. 


A high speed interface permits communications 
with a host computer system for the purpose of 
image loading and readback. The host will also be 
responsible for archiving and retrieving ; 
appropriate data files, and converting formats to 
the internal object representation. The system 
controller is responsible for coordinating the 
activities of the 64 PEs by generating the 
sequence control tables for each desired object 
orientation. The control table includes X, Y, and 
Z position offsets for each of the subcubes making 
up object space. This information is used to 
recursively compute the absolute offsets for every 
point in the output image as well as the order of 
processing of voxels for the hidden surface 
removal algorithm. 


Object Memory System 


To display any set of objects with a scale 
factor of 1:1 within a 256-cube object space 
requires 16 M Bytes of high speed RAM (assuming 8 
bit quantization for each point). While this may 
seem to be an extremely large amount of high speed 
memory, it should be recognized that the steep 
decline in MOS memory prices is expected to 
continue for some time. In addition, even at 
current prices, the cost of the overall display 
device (which is dominated by the cost of this 
memory) should be relatively small compared to the 
cost of a complete medical imaging system such as 
a CAT scanner. 


Since the object space is divided into 64 
equal subcubes, each PE requires 256 K Bytes of 
associated memory. Suitable memory management 
hardware located between the host and the 
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PE-memory system facilitate direct computer 
accesses to restricted regions of object space 
such as X, Y, or Z planes or variable size cubical 
areas. 


In each memory module, data are organized 
into groups of eight voxels occupying a pair of 32 
bit words. Each such group constitutes a 2-cube. 
Five bits are required to specify a 2-cube index 
for each coordinate axis. Each memory access 
retrieves a word pair which is buffered in a 
register between the memory and the processing 
element permitting an entire 2-cube to be 
traversed in any order. The sequence control 
table is used to generate addresses for the memory 
access controller. 


Display Processing Elements 


Each processing element consists of a 
pipelined arithmetic processor, input density 
lookup table, its copy of the sequence control 
table, and a dual 128x128 image buffer memory. 
Figure 4 illustrates the overall organization of 
the PE and its associated 64-subcube input object 
memory module. 
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Figure 4 - Processing Element (PE) Organization 


The density lookup table is used to 
preprocess the voxels retrieved from memory for 
various purposes including selective masking, 
thresholding, or image enhancement based on 
density value. 


The arithmetic processor is responsible for 
computing X, Y, and Z offsets for each pixel of 
the image based upon the position of the 
corresponding input voxel. This is accomplished 
with no multiplies, divides, or other time 
consuming arithmetic or logical operations. 
ean be seen in Figure 5, the most complex 
operation is arithmetic addition. To obtain 
successively finer position offsets requires 
shifts but these are performed within the wiring 
of the pipelined system. Only the data paths for 
position computation along one coordinate axis are 
shown - the other two are similar. 
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Figure 5 - Major Arithmetic Processor Data Paths 


Sequence Control] Table Data 


The operation of the arithmetic processor is 
based on the time-ordered display algorithm used 
for hidden surface removal. The fundamental 
concept which simplifies the hardware 
implementation is that regardless of object 
orientation, each subcube is entirely independent 
and all 64 subcubes may be processed in parallel 
since for any given subcube, every other subcube 
is either entirely in front of it or entirely 
behind it. The same characteristic also permits 
the overall computation of X and Y positions to be 
accomplished recursively, starting with the 
largest subcubes and working down to individual 
voxels, dividing by 2 at each step. For any 
particular voxel, the position offset along a 
given axis (X, Y, or Z) can be computed by simply 
adding the appropriately shifted control table 
entries. Thus, for a single orientation, only one 
control table of position offsets is sufficient 
for computation of all X, Y, and Z positions. 


The precise X and Y destination coordinates 
in the 128x128 buffer are computed and converted 
to a memory address where the video (density and Z 
value) will be stored. A double buffer enables 
computation to proceed while the alternate buffer 
is being merged in subsequent stages of 
processing. 


Anti-Aliasing 


For magnifications from object space to image 
Space greater than 1.732:1 (1/¥3 ), holes would 
appear in the output image at certain orientations 
if anti-aliasing techniques were not utilized. 

Two methods have been investigated thus far: 
display of the centers of the visible faces of 
each voxel (l-cube) and double resolution 
interpolation with resampling. The first of these 
represents the singular case of the DISPLAY 
algorithm where the object cube faces have a size 
of exactly one pixel. At most, there will be 
three faces visible from any ephentatton: 
Interpolating out to a double resolution 3-D grid 
and resampling is similar to anti-aliasing 
techniques used in graphics display processors. 
Both of these require more sophisticated 
processing and additional buffer memory in each of 


‘again 8-fold into the final output buffer. 
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the PEs, but can be accomplished within the 
pipelining time constraints. 


Sequence Control Table (SCT) 


The SCT contains 8 entries sorted in the 
required time-order defining the X, Y, and Z 
offsets of the centers of the 8 largest subcubes 
with respect to the center of object space. 
Offsets for successively smaller subcubes are 
determined by shifting the table entries by an 
appropriate amount (between O and 7 places to the 
right). Adequate precision must be maintained in 
the table to achieve consistency of the offsets 
and prevent objectionable boundary errors from 
appearing in the final image. In addition to 3-D 
rotation, many other interactive capabilities can 
be implemented through modifications of the 
entries in the sequence control table and simple 
additions to the display processor hardware. A 
few of these are described below. 


General anamorphic scaling is accomplished by 
simply multiplying the X, Y, and Z values stored 
in the table by the appropriate scaling factors 
with suitable interpolation of the input density 
data. Translation in 3-space is easily supported 
by adding X, Y, and Z offsets to the addresses of 
the output image buffer. 


The display of up to 64 independently 
configurable objects can be achieved by loading 
object specific SCTs into each of the individual 
PEs or selected groups of PEs and modifying the 
implementation of the merge algorithms. This 
would permit complete control for objects within 
their own subcubes. These "sub object spaces" 
could include any 3-D rectangular region 
comprising multiples of the basic 64-cube. Other 
display parameters can be associated with the 
individual PEs including a translation offset and 
the tone scale mapping to be used for the input 
data. 


Intermediate Processors and Buffers 


Each of the images produced by the display 
processing elements consists of a two dimensional 
array of 112x112 points (in a 16384 word - 128x128 
point memory - 1:1 scale factor) corresponding to 
the largest possible 2-D projection of the 
64-subcube. These 64 images must eventually be 
merged into the output 512x512 frame buffer. This 
is accomplished in two steps. First the 64 images 
are combined 8-fold into the 256x256 point 
intermediate buffers, and then these are combined 
The 
first step is performed in parallel by the eight 
intermediate processors. Each of these merging 
processes requires the computation of position 
offsets as described for the individual PEs, 
above. Double buffers permit the merging and 
readout operations to be taking place 
concurrently. As described above, the same SCT 
determines both the order of computation within a 
PE and the order of combining for the merge 
operations. Figure 6 shows the second stage merge 
operation. The first stage data flow is similar. 
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PGs represent processor groups 
which include 8 PES and their 
associated Intermediate 
Processor and Buffer. 


Figure 6 - Second Stage Merging - Intermediate Buffer to Output Buffer 


Output Frame Buffer 


The final buffer stores the output image and 
Z depth values for use by the shading hardware. 
The major function of this memory is to permit 
scan conversion to standard video format for 
display on a monochrome or color raster scan TV 
monitor. This memory is directly accessible by 
the host. Since the output buffer is larger than 
the size of any projection of the 256-cube object 
(assuming no scaling), space is available to 
display other pictorial data or text. Any 
displayed image may be read back to the host for 
archiving or further processing. 


Computation Pipeline Timing 


Using the architecture outlined above, we can 


calculate the expected performance and throughput 
of the system. We assume that the required 
processing time is 100 ns per primitive 
calculation. This represents a conservative 
design guideline for discrete TTL or high 
performance NMOS VLSI technology. 


* The time required to generate a subimage (by 
the PE) from the 64-subcube is 256 K x 100 ns 
or ~25.6 ms. 


* The time required to merge groups of 8 subimage 
buffers into a 256x256 intermediate buffer is 8 
x 112 x 112 x 100 ns or 7~1l ms. 


* The time required to merge 8 intermediate 
buffers into the output buffer is 8 x 224 x 224 
x 100 ns or ~40 ms. 


Thus, the limiting time is the last - 
corresponding to a frame update rate of 1000/40 or 
approximately 25 frames/second. Note that because 
of the pipeline latency, however, a response to a 
change in orientation will require a total of 
three frame times to become visible. 


Assuming output to a standard NTSC compatible 
video monitor, the full 25 frame per second 
throughput rate can be exploited by switching 
buffers whenever a new frame has been completely 
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loaded. Alternately, a dual port memory system 
may be used for the output buffer. However, 
visible image changes (breaks) may occur for fast 
changing objects during the frame display. An 
effective update rate of 20 frames per second can 
be easily achieved by displaying each frame 
generated by the display system 1-1/2 times 
corresponding to 3 video fields using 2:1 
interlaced scanning. 


Postprocessing 


Two types of postprocessing are to be 
implemented in real time: tone scale lookup 
tables for the video intensity and other display 
parameters, and some form of shading to enhance 
the appearance and realism of the image. 


Tone scale transformation hardware will 
permit the class of point type image processing 
functions which are traditionally used with image 
processing systems to be implemented on the output 
image in real time. Examples of these operations 


include contrast enhancement, interactive 
thresholding, and pseudo-color processing. 


Figure 7 - Examples of Software Siulation using 


Depth-only Shading 


Realistic shading of the output image is 
essential to provide depth cues and other visual 
information about object structure. We have 
conducted software simulations using 
"distance-only" shading, "constant" shading, 
"contextual" shading [3] and Phong shading [11]. 
In distance-only shading, the intensity of a point 
of the image is determined by the distance of the 
corresponding point of the object from the light 
source. This is simple to compute and gives 
pleasing results (Figure 7). The other shading 
models take direction into account by computing 
the inner product of the normal to the surface 
with a unit vector along the light ray reaching 


it: this provides curvature information. In 
constant shading, the normal assigned to a point 
is independent of its neighbors, whereas in 
contextual shading the overall configuration is 
taken into account. Phong shading, which is an 
extension of Gouraud shading [12], is an attempt 
to make the intensity vary smoothly from point to 
point. 


The distance-only shading algorithm simply 
uses the Z coordinate (depth) to obtain the 
brightness of each output point. Most other 
shading schemes are more difficult to implement 
since they are non-local operations requiring 
knowledge of neighboring voxels. One solution 
would use a gradient operator on the Z coordinates 
to obtain the surface normal at each point. 
Alternatively, local direction information can be 
stored in each voxel (along with the density) and 
passed to the shading postprocessor. This 
approach has bee used effectively for the DISPLAY 
software package. A hardware implementation of 
directional shading would, however, necessitate 
expanded object and buffer memory capacity and 
wider data paths in addition to the shading 
postprocessor. 


Summary and Conclusions 


The hardware architecture for a high speed 
image display system which would permit the 
real-time manipulation of 3-D objects obtained 
from real world data has been developed. A key 
feature of this system is support for interactive 
manipulation of the object in 3-D space including 
rotation, translation, and scaling. 
Straightforward algorithms containing no complex 
arithmetic or logical operations are utilized 
throughout. The hierarchical organization and 
high degree of modularity will facilitate an 
effective implementation with VLSI technology. 


Functional simulations of the display 
processor have been performed based on the 
existing DISPLAY software package. Current 
efforts are directed toward development of 
effective 3-D anti-aliasing techniques and 
investigation of more sophisticated shading 
algorithms which are appropriate for hardware 
implementation. 


We are planning to prototype real-time 
hardware for one 64-subcube of the overall system 
using Schottky TTL devices and MOS dynamic RAMs. 
The prototype will permit experience to be gained 
from actual use by medical researchers and 
clinicians. For some types of imaging techniques 
where the current attainable resolution is 
relatively low, this subset of the overall system 
may prove adequate. The prototype effort will be 
followed by the full system implementation with 
heavy reliance on VLSI technology. 
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Abstract -- Speech understanding is a complex task 
which requires extensive computation. To increase the 
processing speed, a speech understanding system is 
decomposed into tasks which can be performed by a 
series of distributed processing sub-systems. An architec- 
ture to perform labeling, segmentation, and lexical pro- 
cessing is described. Using a parametric characterization 
of the speech signal, this system divides an utterance into 
labeled homogeneous regions. The system then performs 
dictionary lookups based on all probable labelings and 
segmentations in order to generate a complete set of word 
hypotheses. Using realistic assumptions from existing 
speech understanding systems, a statistical model of 
speech input, and simulations of the speech processing 
algorithms, the attributes of the parallel system to per- 
form labeling, segmentation, and lexical processing for 
real-time speech understanding are derived. 


I. Introduction 


A speech understanding system accepts speech input, 
derives a conceptual understanding of the input, and gen- 
erates an appropriate response. In a typical system, a 
number of knowledge source components interact to 
resolve the errors and ambiguity inherent in human 
speech. These knowledge sources perform operations 
such as acoustic parameterization, phonetic interpreta- 
tion, lexical processing, syntactic analysis, semantic 
interpretation, and response generation. Existing speech 
understanding systems that have been developed are 
described in [19, 21, 25]. 

The extensive computation required precludes real- 
time speech understanding on a conventional serial com- 
puter. To improve the processing speed, the different 
knowledge sources can act in parallel (possibly on 
different portions of an utterance), and in addition, com- 
putational tasks within each knowledge source can be 
performed in parallel. Advances in technology have made 
it realistic to consider large-scale parallel processing sys- 
tems [e.g., 1, 10, 24, 29]. By designing multiprocessor 
knowledge sources, real-time speech understanding (with 
a constant delay) should be achievable. The next section 
briefly outlines a general configuration for a multiproces- 
sor system for speech understanding. In sections III and 
IV, the speech understanding operations of labeling, seg- 
mentation, and lexical processing are outlined. Section V 
describes the organization of the system described in this 
paper. An example simulation of labeling, segmentation, 
and lexical processing is presented in section VI. Section 
Vif describes the parallel algorithms and the parallel 
architectures that comprise the system. 


This material is based upon work supported by the National Science 
Foundation under Grant ECS-8120896. 
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Fig. 1. The distributed speech understanding system. 


II. A Parallel Architecture For Speech Understanding 


An architecture proposed to handle the speech 
understanding task consists of a distributed series of 
compustations [2, 3]. Each compustation corresponds 
roughly to a speech understanding knowledge source. 
This distributed parallel architecture is diagrammed in 
Fig. 1. The interconnection of knowledge sources forms a 
linear pipeline. 

Each compustation may itself be a parallel system. 
A typical compustation consists of an input memory 
buffer (MB), an output MB, control units (CUs), and pro- 
cessing elements (PEs). The organization of the PEs 
within each compustation is selected to exploit whatever 
parallelism is mherent in the specific task being per- 
formed by that station. The processing time for each sta- 
tion is a function of the computational complexity of the 
tasks to be performed and the amount and arrival rate of 
input data. Assuming a maximum input rate, the pro- 
cessing speed requirements can be met by employing 
parallelism within the task algorithms and also among 
the tasks to be performed. Minimum processing time for 
the compustation will be insured when the data in the 
input MB is processed as soon as it is available. When a 
subset of PEs has finished a processing task and stored its 
result in the output MB, it is available to be assigned 
another task by the compustation’s control unit. 

Each compustation is specialized to meet perfor- 
mance (speed) requirements of the overall system. Pro- 
cessing proceeds asynchronously with respect to adjacent 
compustations. When the processing time for each sta- 
tion is approximately equal, then no bottlenecks occur 


and data flow through the system will be continuous, pro- 
viding real-time performance (with a constant delay). 
Because the parallelism within each compustation permits 
processing of all probable utterance hypotheses simul- 
taneously, there is no need to backtrack once any partic- 
ular hypothesis has been determined improbable. Thus, 
extensive parallelism is being used at each stage of the 
speech understanding process in order to simplify the 
interaction among the various knowledge sources. 


if. Labeling and Segmentation 


Speech labeling is the task of analyzing a speech 
utterance and assigning identifying labels to regions of 
the utterance. Speech segmentation is the task of locat- 
ing boundaries between homogeneous regions (segments) 
in a speech utterance. In some systems, labeling precedes 
segmentation; in others, segmentation precedes labeling. 
Together, the labeling and segmentation processes divide 
a speech utterance into labeled homogeneous regions. 
Speech labeling and segmentation are described in (27, 28, 
36]. Detailed descriptions of methods used in specific sys- 
tems are presented in [8, 13, 34]. 

The input to the labeling and segmentation process 
will be characteristic parameters computed for each frame 
of speech during acoustic processing. In general, each set 
of characteristic parameters represents an interval of 
speech ranging from 5 to 20 ms. The acoustic parameters 
describe each speech frame in terms of its characteristic 
time and frequency domain features [4, 36]. Labels can 
be assigned either to fixed duration frames (before seg- 
mentation) or to larger homogeneous segments (after seg- 
mentation). The labeling is typically done using pattern 
classification techniques. The criteria used and _ the 
number of different labels assignable to the regions must 
be consistent with the word representations used within 
the lexical database. The labeled segments produced will 
be used by the lexical processing component of the sys- 
tem to obtain word hypotheses. | 

The labels used correspond to elements of a finite set 
of speech sounds which can be identified by definite 
acoustic characteristics. The most commonly used set of 
speech sounds are phonemes ne A phoneme is a group 
of sounds that function similarly and are not meaning- 
fully distinctive among the group for a given language. A 
language will contain from 20 to 60 phonemes. Examples 
of different phonemes are the ‘‘oo” sound in ‘‘foot,” the 
‘‘oo” sound in ‘“‘boot,” and the ‘‘s” sound in “‘sit.” It is 
often difficult to identify phonemes and their boundaries 
accurately from acoustic data [27]. Many intervals of 
speech will appear similar to more than one phoneme. 
These problems are handled in existing speech under- 
standing systems by permitting multiple classifications for 
a given speech interval and then applying a more exten- 
sive lexical analysis. 

Techniques to perform labeling and segmentation are 
elther context independent or context dependent. A con- 
text independent scheme looks at the acoustic parameters 
for the speech frame, compares these parameters to previ- 
ously stored templates corresponding to the speech 
sounds label set, and essigns a label to the speech frame. 
A correctness score could be associated with the label. In 
some systems [34], the multiple labels for each frame are 
stored in a structure called a segment lattice. This 
preserves. all probable frame labels and provides the 
option for the system to backtrack and consider less 


probable labels when the word hypothesis based upon one 
labeling becomes unlikely. 

A context dependent labeling and segmentation 
scheme uses information from the current and surround- 
ing frames or segments when labeling a given speech 
interval. Some systems use a combination of both con- 
text dependent and context independent methods. 


IV. Lexical Processing 


Lexical processing is the task of combining various 
lengths of contiguous sequences of segmented and labeled 
speech data, comparing these intervals to a word diction- 
ary or lexicon, and generating probable word hypotheses. 
Noise, variable pronunciations, mispronunciations, 
regional dialects, and variable speaking rates make it 
difficult to map. acoustic events directly to word 
hypotheses. Lexical processing must take into account 
possibilities such as mislabeled input segments and varia- 
tions in pronunciation when attempting to match an 
input utterance to entries in the word dictionary. Lexical 
processing is described in [18, 32]. Detailed descriptions 
of a used in specific systems are presented in [11, 
31, 34]. 

A number of lexical processing approaches exist. One 
method for determining possible words within continuous 
speech, called ‘‘bottom-up” word hypothesizing, uses 
acoustic information from the input utterance to propose 
words. In one possible implementation, the system 
includes a data structure constructed from acoustic 
descriptions of all words comprising the system’s vocabu- 
lary. The acoustic information obtained form the input 
utterance will control the search through the data struc- 
ture to hypothesize a word. 

The output of lexical processing will be word 
hypotheses accompanied by likelihood scores. This infor- 
mation is passed to other knowledge sources such as syn- 
tax and/or semantics for further processing. 


V._ The Distributed/Parallel Organization 


As described in sections [II and IV, there are many 
different methods that can be used to perform the opera- 
tions of labeling, segmentation, and lexical processing, 
and the boundaries between these operations are not 
necessarily clear. It is therefore appropriate to consider a 
single architecture to perform all of the operations 
involved in labeling, segmentation, and lexical processing. 

The merging of compustations is depicted in Fig. 1. 
and the organization of the architecture is shown in Fig. 
2.. The portion of the figure marked by (@) indicates the 
PEs performing parallel labeling and segmentation. 
These PEs accept sets of time and frequency domain 
characteristic acoustic parameters which represent a 12.8 
ms frame of speech. These parameters are stored within 
the input MB by the Acoustic Processing Compustation. 
(A design of an Acoustic Processing Compustation was 
preserted in [4].) The labeling and segmentation opera- 
tions produce a segment latttce. The segment lattice will 
contain multiple phoneme labels for each frame of speech. 
This data set is indicated by @) in Fig. 2 and would 
reside in a MB. The second set of PEs, indicated by ©), 
will use the segment lattice data to perform parallel word 
hypothesization. The multiple word hypotheses, indi- 
cated by (), are stored in the output MB and will be 
used by the Syntactic Processing Compustation to per- 
form syntactic analysis. 
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VI. An Example Simulation 


To determine the requirements of a parallel architec- 
ture to perform real-time labeling, segmentation, and lex- 
ical processing, a specific set of algorithms was chosen 
and the operations of the algorithms were simulated. In 
this section, the algorithms and operation are described 
and the approach used in the simulation is outlined. 


Algorithms and Operation 


For each 12.8 ms frame of input speech, the 
compustation obtains a set of characteristic acoustic 
parameters from the input MB. The compustation 
determines the similarity of these input parameters to 
stored phonetic templates and produces a set of probable 
phonetic labels for the frame based upon this similarity. 
In the algorithm considered here, the probable labels are 
determined by classifying the speech-spectra using a 
minimum distance measure with linear mean correction 
Sar This technique computes a distance measure 

etween 40 power spectrum values representing the input 
speech and a set of 40 template values stored for each of 
33 possible phoneme labels. The distance measure results 
are then sorted to determine the most probable labels for 
the frame. Each distance measure is compared to a 
threshold and from one to five most probable labels are 
assigned to the frame. By keeping up to the five most 
probable labels for the frame, the correct label will be 


included 95% of the time [9]. This step is a form of 
context independent labeling on fixed duration frames. 

Once labels have been assigned to a frame, the previ- 
ous and following frames are examined for occurrence of 
each phoneme label. If a phoneme label does not occur in 
either of the neighboring frames, the label is eliminated 
from the set of labels for the frame. If all of the labels 
for a frame are non-existent in neighboring frames, then a 
new symbol indicating an unknown phoneme label is as- 
signed to the frame. This context-dependent label remo- 
val introduces an additional one frame (12.8 ms) delay to 
the speech processing. 

Considering an interval of speech containing a 
number of segments, the multiply labeled frames over 
that interval form a phoneme segment lattice (e.g., @ in 
Fig. 2). The paths through the lattice define the possible 
words within the speech input. The lattice is searched to 
generate all possible strings of phonemes. Bottom-up lex- 
ical processing is then performed, in which these strings 
are compared to a lexical database of phoneme strings 
which represent the words known to the system. From 
this comparison, word hypotheses are determined. 

The lexical database assumed represents a word vo- 
cabulary of 6,843 common English words. Each word is 
stored by its phonetic representation, with the longest 
phonetic representation consisting of seven phonemes. 
These words were taken from a vocabulary of 10,065 En- 
glish words with phoneme string representations of up to 
fifteen phonemes [16, 26]. In the simulation, the strings 
generated from the speech input range in length from 
three to seven phoneme segments. (One and _ two 
phoneme strings are handled as special cases.) When the 
frequency of use of the words within the database is tak- 
en into account [16, ca the 6,843-word vocabulary (i.e., 
the words of seven or fewer phonemes) comprise 95% of 
the words actually used. Multiple representations of a 
single word are included in the lexicon. It is assumed 
that these are generated by applying contextual and pho- 
nological rules to the theoretical word representation. 
The rules will account for pronunciation deletions, inser- 
tions, coarticulation, and multiple mispronunciations (7, 
35]. This lexical expansion will result in an approximate- 
ly three-fold increase [35] in the lexical database size. 
The resulting database contains 20,529 entries. 

Because all paths through the phoneme lattice are 
compared to the database, multiple conflicting word hy- 
potheses may be generated. In addition, every point in 
the lattice at which there is a transition from one 
phoneme to another represents a potential end of one 
word and beginning of the next. Word hypotheses based 
on all such word boundaries will be generated. A special 
case of this may give rise to multiple possible start and 
stop times for a single word. The multiple word 
hypotheses and the range of start and stop times will be 
stored in the output MB and resolved by a later 
compustation. 


Simulation of the Algorithms 


The operations of the algorithms to perform labeling, 
segmentation, and lexical processing were simulated on 
dual-processor Vax 11/780 computers [14] which are part 
of the Engineering Computer Network of Purdue Univer- 
sity [17]. The components of the simulation are di- 
agrammed in Fig. 3, and are discussed in the following 
paragraphs. The bold arrows indicate the flow of the 
simulation. 

The input to the simulation was a stream of 
phonemically labeled frames which model speech input. 
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The statistical distribution of the phonemes in the input 
is based upon a statistical linguistic analysis [26] of a vo- 
cabulary of 10,065 English words [16]. This information, 
combined with the average duration of each phoneme [15, 
33], was used to model the input stream as a stochastic 
process [6]. A 33-state Markov chain was used to gen- 
erate the phoneme stream. The phoneme stream statisti- 
cally models the percent occurrence of each phoneme or 
silence, the duration of each phoneme or silence, and the 
transistion probability of two adjacent phonemes. A sta- 
tistical analysis was performed upon the generated 
phoneme stream to verify the accuracy of the speech 
model. The statistically generated phoneme input was 
used to avoid the difficulty of performing computationally 
intensive acoustic parameterization on the enormous 
amount of speech input data which would be required to 
obtain representative phoneme distributions and patterns. 
Bh of the phoneme stream generation can be found in 
d}. , 
The generated phoneme stream contains one 
phoneme label per input frame of speech. The results of 
the labeling algorithms are simulated by generating mul- 
tiple phoneme labels for each frame, based on labeling 
performance data which indicates the probability of 
confusing the true input phoneme with acoustically simi- 
lar phonemes Fee 

The resulting multiple phoneme label input stream 
corresponds to the segment lattice data and is used as an 
input to simulate the complexity of the labeling and seg- 
mentation algorithms. The segment lattice also provides 
the input to the lexical processing simulation. In particu- 
lar, the “Phoneme String Generation” process provides 
statistics on the number of strings of various lengths that 
will be compared to the lexicon. , 

Complexity data from the labeling and segmentation 
simulation, results of the phoneme string generation, and 
the lexical database statistics are used to simulate the 
operations of lexical processing. 


Vif. The Parallel Architectures 


In this section, the parallel sub-systems to perform 
labeling, segmentation, and _ lexical processing are 
described. The PE model used in the architecture design 
is presented in [20] and is based upon the Motorola 
MC68000 microprocessor [23]. Processor timing calcula- 
tions were made with the MC68000 running Motorola’s 
68343 fast floating point software [22]. The execution 
times were calculated for the MC68000 running with a 
12.5 MHz clock frequency. 13,100 frames of speech data 
were simulated to obtain the operations counts and data- 
base sizes. 

For the architecture design presented here, it was as- 
sumed that each sub-system must attain real-time perfor- 
mance based on the average arrival rate of data to its in- 
put MB. This implies that, on the average, real-time per- 
formance will be achieved as long as excess data can be 
buffered and processed later. A more stringent analysis 
would require the processing to meet maximum data 
rates. Because of the very large variance in the data 
rates from one frame to another, this approach is not 
practical and is not considered here. 


Labeling 


The labeling sub-system must compute the distance 
from the set of input acoustic parameters to the tem- 
plates for each of the 33 possible phonemes, select the (up 
to) five most probable phonemes, and perform the con- 
text dependent removal of isolated (single frame) labels. 
The context independent labeling must be performed 
within 12.8 ms so that this labeling is completed by the 
time the next set of input parameters is available. Once 
the subsequent frame is labeled, the context independent 
labeling must also be completed within 12.8 ms. Based 
on the operations required, the architecture to perform la- 
beling is an eight PE MIMD system with one CU. The 
comparison of the input parameters to the 33 templates is 
distributed across the PEs, with each PE computing the 
distance to four templates. The CU computes the 33rd 
distance and performs the final phoneme selection and 
context dependent labeling. Communication is between 
the CU and the PEs, first to disseminate the acoustic 
data then to collect the computed distances. No inter-PE 
communication is required. 


Segmentation 


Once a frame is labeled, the segmentation sub- 
system must determine what new strings exist in the 
phoneme segment lattice as a result of the possibly multi- 
ple labels for that frame. This. involves appending each 
new label to every length six or shorter string in the lat- 
tice and determining what new strings result. The 
phoneme segment lattice is updated (i.e., the new strings 
are added to the lattice) and the strings which are 
identified as possibly complete are passed to the lexical 
sub-system as potential word hypotheses. Processing 
should be completed by the time the next set (frame) of 
labels is available. In general, the architecture to perform 
segmentation consists of an input CU and seven sub- 
systems, each an MIMD system consisting of a CU and a 
set of PEs. The input CU monitors the arrival of new in- 
put label sets and broadcasts the input MB contents to 
the sub-system CUs. Sub-system f holds the lattice en- 
tries of length f, 1 < @ < 7. Using the labels for the new 
frame, sub-systems 1 through 6 form length ¢ +1 strings 
and pass these to sub-system { +1 for possible addition to 
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the length ¢ +1 lattice database. Similarly, sub-systems 2 
through 7 receive length @ strings formed in sub-system 
( — 1 for possible addition to the length @ lattice data- 
base. 

If sub-system @ has Py PEs, then the length @ strings 
will be distributed approximately evenly evenly across 
the Py PEs. Each PE forms new strings using the new 
labels and its portion of the database. Similarly, each PE 
searches its portion of the database when determining if a 
string already exists in the database. Although the PEs 
perform essentially the same algorithm, MIMD operation 
is used to allow the string comparisons to abort as soon 
as possible. No communication is required among the 
PEs. The CU must be able to broadcast data and control 
signals to the PEs and must be able to poll the PEs for 
the status of their operation. Communication from sub- 
system ¢ to sub-system § +1 is performed through an in- 
termediate MB between the CUs. Finally, CU { will pass 
length ¢ strings to the lexical processing sub-system for 
dictionary lookup. This communication also proceeds 
through an intermediate MB. 

Using the simulation results, the processing for the 
length one, two, three, and four strings can be handled in 
real-time by a single processor. In order to maintain the 
average processing rate in real-time, sub-systems 5, 6, and 
7 are MIMD systems with P; = 2, Pg = 5, and P, = 21. 
Each of these sub-systems has its own CU. The architec- 
ture for segmentation therefore consists of four sub- 
systems. The maximum amount of PE memory required 
for storage of the phoneme segment lattice ranges from 
5.6K 16-bit words for the 1/2/3/4 sub-system to 117K 
words for the length 7 sub-system. The total memory 
needed for storage of the lattice is 169K words. 


Lexical Processing 


The lexical processing sub-system compares strings 
from the phoneme segment lattice to the stored lexicon 
and outputs word hypotheses. For real-time operation, 
the rate at which the sub-system can search the lexicon 
must equal the rate at which strings arrive from the seg- 
mentation sub-system. Based on the simulations, the 
average rate will be 177 strings per 12.8 ms frame. The 
general architecture for lexical processing consists of an 
output CU and seven sub-systems, each an MIMD system 
consisting of a CU and a set of PEs. Sub-system @ holds 
the length @ portion of the lexicon, 1 < ¢ < 7. Within 
sub-system (, the lexicon entries are ordered and are dis- 
tributed across the Pg PEs, so that PE p holds item 
p mod Py, 0 < p < Py. A MB resides between segmen- 
tation CU @ and lexical processing CU ¢ and receives the 
search request strings. Lexical CU @ broadcasts each 
string to its PEs, each of which performs a binary search. 
The CU must be able to broadcast data and control sig- 
nals to the PEs and must be able to poll the PEs for the 
status of their operation. No inter-PE communication is 
needed. Located word hypotheses are passed from the 
sub-system CU via an intermediate MB to the output 
CU, which writes the hypothesized word and its associat- 
ed information in the compustation output MB. 

Based on the simulations and the assumptions about 
the lexicon size, and requiring lexical processing to match 
the average arrival rate of strings from the segmentation 
sub-system, a single PE is capable of performing the dic- 
tionary searches in real-time. Four factors contribute to 
this result. (1) Relatively few strings need to be looked 
up (on the average, 177 per frame); (2) the binary search 
is very efficient; (3) within the binary search, it is rare 
that all of the characters in the two strings need to be 
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compared; (4) the time for a symbol comparison on the 
MC68000 is quite short. Therefore, for the specific as- 
sumptions made here, the lexical sub-system can consist 
of one PE which scans the intermediate MBs holding the 
segmentation output, performs the lookups, and writes 
the located strings into the output MB. If a different lex- 
ical database were assumed, allowing, for example, longer 
strings, larger vocabulary, or more variability in pronun- 
ciation, the multiprocessor architecture would be applica- 
ble. Using the assumptions outlined here, the memory re- 
quired for storage of the lexical database is 139K 16-bit 
words. 


VI. Summary 


A parallel architecture to perform the speech under- 
standing operations of labeling, segmentation, and lexical 
processing has been described. The overall architecture 
consists of a total of 43 (42 + 1) processors organized 
into three major components, corresponding to the three 
principal operations performed. Within the component 
systems are MIMD or uniprocessor sub-systems. The ar- 
chitecture characteristics were derived using realistic as- 
sumptions about both speech data and speech processing. 
Extensive simulation was performed to determine the re- 
quired processing and data rates. 

The algorithms chosen represent one set of the many 
different techniques used in labeling, segmentation, and 
lexical processing. Clearly the architectural details will 
reflect the algorithms chosen. This specific study demon- 
strates the viability of the use of parallelism in speech 
understanding systems and explores techniques for deter- 
mining system characteristics from specific applications 
requirements. Future work includes analysis of addition- 
al speech processing methods and knowledge sources. 
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ABSTRACT The absence of well or Cory SYNTAX AND INFORMAL 


Sulted tools for defining semantics of 
parallelism in programming languages is 
one reason of this study. 

We suggest using, in this framework, 
the algebraic abstract data type tech- 
niques, due to the advantages they of- 
fer and which have been justified in 
several papers. 

A simple and deterministic language 
With certains features of parallelism 
1s presented. And an abstract data type 
associated with it is defined to make 
1ts semantics precise. 


KEYWORDS Concurrency, communi- 
cation, algebraic abstract data type, 
semantics, serial time. 


1. INTRODUCTION 
Abstract data types are suggested by 
several-authors CU (5 )-5. 1.714.124. 622) 
to be a flexible, uniform, powerfull 
and abstract tool for specifying formal- 
ly data structures, processed by pro- 
gramming languages. 
Applying the same technique, especially 
the algebraic methods, to sequential 
applicative and procedural languages 
had been already done to define the 
semantics of program variables and as- 
Signements [9] » procedures [4] » or 
recursive functions bay , 
In this approach, a programming langua- 
ge can be described by an abstract 
(data) type, as described for example 
in [1] , for which the meaning is gene- 
rally explained by particular models 
such as initial or terminals ones, or 
the set of all possible models (which 
may be described by sets of congruences 
of the term algebra). Thus it would 
be possible to have several semantic 
models to programs according to the 
choice of the model associated with the 
abstract type. 
The object of the present paper is to 
apply and extend the techniques of al- 
gebraic semantics (for more explicative 
notions see [8] ) to describe paralle- 
lism, using a simple and deterministic 
language, named CSPD, emboding explicit 
forms of concurrency and communication. 
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MEANING 

The main features of this language 

are 

- Processes are disjoint, 
any shared variables. They only 
teract by means of communication. 

- Communication is achieved by means of 
send and receive operations which are 
primitives of the language. Besides 
passing messages, communication may 
play a role of synchronisation. 
RECELVE x FROM P.(in P.) is an input 
command, expressing an input request 
of the process P. from the process PP. 
and assignment of the input value tot 
the, Cloeal) variable: x, 

SEND y TO P. (in P.) is an output 
command, meaning a request of P. to 
output the value of y to 

-~ Processes are deterministié and con- 
currency between them is explicit by 
a parallel operator. 


don't have 
in- 


The other components of the language 
are the nop and abort constructions, 
the sequential composition (stmt1; st- 
mt2), the assignement (variable := expres- 
sion), the conditional (IF boolexp 

THEN stmt1 ELSE stmt2 FI), the while 
loop (WHILE boolexp DO stmt OD) and ty- 
ped declarations (variable : type)- 

An example of CSPD programs is the 
following 

PROGRAM Toy;... 


PROCESS Div ; 


VAR x, y, quot, rem : INTEGER; 
BEGIN 
RECEIVE x FROM User; RECEIVE y FROM User; 
quot := 0; rem := x ; a 
WHILE rem > y DO 
rem := rem-y; quot := quot+1; 
OD; 
SEND quot TO User; SEND rem TO User; 
END; 
PROCESS User; 
VAR a,b,q,r INTEGER; 
BEGIN 


IF a > 0 AND b > 0) 


THEN SEND a TO Div; SEND b TO Div; 
RECEIVE q q FROM Div Div: 
RECEIVE r FROM div 
ELSE NOP 
FI ie icdear 


END; 
PARBEGIN ., 


..-Div//User ... PAREND. 


The language, if summarized, corres- 
ponds to the following syntax, where 
id, procid, declarations, var, exp and 
boolexp are non terminals to which we 
give no syntax rule and which represent 
respectively the program identifier, a 
process identifier, a local declara- 
tions, a variable, an expression and a 
boolean expression, 


program —> PROGRAM id;{process}; 
PARBEGIN procid {//procid} PAREND. 
process ——> PROCESS procid; declara- 
tlons; BEGIN stmt END 
stmt ——> NOP | ABORT! va 


ee a) 


iF boolexp THEN stm1 ELSE stmt2 FI| 
WHILE boolexp DO stmt OD | 
stmti;stmt2] SEND exp TO procid | 


RECEIVE var FROM procid 


var !:= exD | 


3. FORMALIZATION OF THE SEMANTICS 
OF CSPD. 


thesis. - 
We suppose 
- defined in a formal manner the con- 
text conditions of CSPD. From now on, 
we deal only with legal programs de- 
signed in CSPD (no static semantics 
are given). 
~ defined the basic (primitive) types 
that occurs in CSPD. 
We have in particular 
VALUE : the type of value sorts, in- 
cludine the sort of identi- 
fiers(*) and the data sort Data, 
for which equality operations are 
predefined. 


The type of expressions with 
a general sort Exp (including 
in particular the sort of boo- 
lean expressions Bexp) for 
which a gubstitute operation 
is given. 
According {8 } to define the semantics 
of a programming language, a method con- 
Sists of extending the semantics, sup- 
posed to be known, of the primitive ob- 
jects to the other abjects, that is the 
constructs of the language. The hierar- 
chical construction of the type is gi- 
ven below. 
(*) The sort Ident maps variable identifiers 
Varid and process identifiers Procid. 
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3.2. CSPD abstract type the sys- 
tematic hierarchical descrip- 
tion 

To make explicit the definition of 
CSPD other types are required, in par- 


ticular those concerning the language 


phrases and (the interpretor) states. 
Thus we have | 
STATE The type of states with a sort 


State generated by 
- a null operation 
init : — > State 
- a two aries operation connec- 
ting states and the language 
phrases : 
apply : Modif x State —> State 
To evaluate expressions into a value, 
we introduce an operation expressing 
this requirement 
eval Exp x State >Data 


MODIF 


The type corresponding to phra- 
ses (statements and declara- 
tions) occuring in the langua- 
ge, with a general sort Modif 
generated by the following ope- 
rations(*), to which we give a 
notation convention on the left 
hand side of the operations 
profile. 


nop, abort — > Modif jnop, 
to explicit the meaning |: 
of empty and abortion sta-+ 


tements, 


abort 


‘ 
! 
seq : Modif x Modif ist 5 s2 
—> Modif 
to explicit the definition of ! 
the sequential composition, 
if : Bexp x Modif x if b then st 


Modif ——> Modif 


ditional statement, 

while : Bexp x Modif 

——> Modif 
to define explicitly the 
while loop statement, 

Send : Procid x Procid 

x Exp ——> Modif 
to explicit the meaning of a 
send statement between a re- 
ceiver and a sender, two con- 
current processes, 

receive : Procid’x peed 

x varid —> Modif 
to explicit the meaning of a 
receive statement between a 
sender end a receiver, two 
concurrent processes. 


! 

{ 

! 

to define explicitly the con- | 
| 


(*) We concentrate our attention only on state- 


ments forgetting:object declarations. 


AGENT The type corresponding 
to a process defini- 
tion, with a sort Agent 
generated by 
declagent Procid x Modif 
—_» Agent 
The type corresponding 
to a CSPD program des- 
cription, with a sort 
Progr generated by put- 
ting together processes 
in a concurrent way 
parall: Agent x Agent 
—> Progr 


PROGR 


3.3. Remarks 
(1) Defining the abstract data type 
for CSPD, semantical descriptions 
remains to be explicited. They will 
be certain terms of this abstract ty- 
pe. For instance, the term correspon- 
ding to an expression is of the sort 
Exp and the term corresponding to a 
statement is of the sort Modif. The 
mapping of concrete programs into 
terms is specified using semantic 
(conditional) equations. 


(2) To be more explicit, the hierdrchi- 
cal construction of the type can be 
schematized as follow. 


app Modifications 
fia aor aaa 


| Sort 


(3) For the CSPD type the following 
basic equalities are required 


.S $ nop = s 
- abort $3; s = abort 
OCS, ) Bod f8g = Sy. 3 C8p. a 85) 


- While b do s od = if b then s ; 
while b do Ss od else nop fi 
"ay | ay = a, | | a, where a. is 
of ‘the form Ps. -i sss 16.42) 
1 1 
With the help of this equalities eve- 
ry CSPD-term of sort Modif can be 
brought into the normal form : s 
where s, is either nop, assign, 
receive or if operation. 


Every term of sort Progr is equivalent 
to a term of the form 


\ ne ae Xo | ae [| ¥isy,; ¥9 


(i = 


> Ss 


3 
er 


1 > 
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Where Kprceery corresponds to an ele- 
mentary or conditional instruction. In 
our study we consider only the systems 
which are designed by two CSPD proces- 
ses, a generalisation with n processes 
remaind to treat. 


(4) Following the nature of modifica- 

tions we consider them in a stratified 
manner. We note at least two kinds of 

modifications 


- Elementary modifications, including : 
The basic modifications which cor- 
respond to the nop, abort and assigne- 
ment constructions. We characterize 

them by the predicate 
bstm? Modif —> Bool 


The input/output modifications which 
correspond to the send and receive com- 
mands. We characterize them by the pre- 
dicate : 


iostm? Modif —> Bool 


- Composed modifications which corres- 
pond to the sequential composition, 
conditional and while loop construc- 
tions. We characterize them by the pre- 
dicate 


estm? Modif —> Bool 


From now on, we preserve this decompo- 
sition (and go on with), because it 
will help us to clarify some difficult 
notions in parallel programming theory 
and to master the complexity origina- 


ted from a large number of functions 
and axioms. 
3.4. CSPD abstract type a formal 
specification 
3.4.1. A specification framework 
3.4.1.1. The problem position 


To specify (i.e, give semantics for) 
the full type two attitudes are possi- 
ble 


(1). Consider that couples generated 
by the function 
config : Modif x State-»Conf 
can be reduced by a semantic function 
reduce : Conf —» Conf 
This is an approach of operational se- 
mantics of systems, in the form of 
transitions systems. For more explana- 
tions see [2] , [11] . 


(2) Consider that a statement trans- 
forms the state by the "apply" opera- 
tion, which we specify its properties 
by using conditional axioms and substi- 
tutions on the formulas. 

To deal with this approach, 


- At first, if we are not interested by the 
communication features, a single process (ta- 
ken out of a set of communicating processes) 
is regarded by itself to be a semantically 
meaningful entity [3]. 

~ Therefore, a binding operator is introduced 
to combine a set of all separate "a priori" 
meanings of all separate processes to a joint 
meaning of the system designed by them in a 
parallel composition, and to settle communi- 
cation features. This definition uses the im- 
portant notion of (execution) time introduced 
below. 


3.4.1.2. States and temporal specification 


A state of a given system designed in CSPD can 
be viewed as a tuple of processes state (the 
processes are these composing the system). Re~- 
call that the processes state are disjoints 
because of the lack of global variables in the 
language and that each one can be viewed as a 
(linear) computation from initial state at an 
initial date (say t_). If we are inierested in 
describing, at a eiven date t, the general 
state of such a system (say ¥. at t), two pos- 
sibilities can take place 


~- 5 is perfect : the processes finish at the 
Samé time t the execution of their current 
instruction. 
~ Y. is imperfect : at least one process fini- 
shes at the time t the execution of its cur- 
rent instruction, while the other not. 
To consider similar situations, we introduce 
the predicate : 

perfect? : State — > Bool. 
To any particular process, one wish to attach 
a date (involving on the other hand all exe- 
cution sequences which a program generates on 
the general state) and a modification. From 
now on, a state is considered as a tuple : 
<memory, data, modification> 
The perfectness of a process state is thus gi- 


ven by : 

perfect? : tf s = nop then true 

else false ft 

We suppose that the initial states (at initial 
date t_) are perfect. 
To coop up the time notions, we suppose as 
known the primitive type DATE with the sort 
Date, ranged over by t, and give below profi- 
les and informal meanings of the needed ope- 
rations : 


(1) dateof : State —> Date 
defines the second projection of a state seen 
as a tuple of the form <o,t,m> where o is a 
process memory of the sort Memory and t is the 
associated date of the sort Date. 


(2) datend : State — Date 
defines the date at which the current modifi- 
cation, submitted to a state, takes off. The 
state at that date is complete. The computa- 
tion of that date depends in particular on the 
duration of the (current) statement execution. 


(<o,t,s>) = 
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Note that this operation cannot be applicable 
to the input/output modifications. 


(3) precedes : Date x Date —> Bool 
defines a total order on Date. 


Remark : 

The "dateof" and "datend" operations can ex- 

press, to a given state, equal dates. The rela- 

tive equations can be written as follow : 
dateof (<o,t,s>) =t 


= tf perfect? (<o,t,s>) 
then t 

else dateof: (apply (s, 
<o,t, nop>)) fz 


datend (<o,t,s>) 


3.4.2. The “a priori semantics" equations 


In this section we want to focus our interest 
in defining the equations which correspond to 
the "apply" operation, connecting a process 
state and its modifications, which are of the 
profile : Modif x State —> State. 

Notice that nothing will be said about the com- 
munication operations, treated in the 
following section. 

The computation of a sequential process P 
starts at an initial perfect state, say 2 

So, (P), t)»nop> where ty is the initial date, 


: "failure, | » which record an erroneous com- 
putation § 


. ‘abortion 
tn 


oS 
n 


" which record an abortion ; 


<o (P), t_, nop> otherwise (t_ being 
n Vv n 
the date of the computation end). 


(1) Suppose to be known at time t the perfect 
state a = <o. (P), t, nop> of the process P. 


The following are the equations, induced on 
the syntax, of the apply operation. 


apply (nop, a) = a 
apply (abort, ae, = abortion, 
apply (v:=e, zd = 


such that ee = 


<E" t', nop> 
(v, 


(t.0°) 


subst (val (eval 

rid), eval (e,2,)) and precedes 
apply (s,385,%,) = 

apply (s,, apply (s,, 2,)) 
apply (if b then s, else s, El; ze) 
tf eval (b, x) = true then 
apply (s,;5 ZL) 
elstf eval (b, Z) = false then 

apply (So ae, 

else failure, Te 


(2) Suppose now that the given state of the 
process P, at that time t, is an imperfect one. 
In a such state, an elementary modification is 
being executed. 

This fact can be described elegantly by intro- 


ducing the operation 

in : Modif x State -——> State (*) 
To that imperfect state no modification can be 
applied (and no expression can be evaluated in- 
to it) until knowning its perfect state. 
So, we have 

apply (sy; in (s,> tog 


apply (sj, apply (s,, 2,,)) 
eval (e, in (S4> zy) = 
eval (e, apply (sj > zy) 


Where > 1s the perfect required state on 
which thé modification s, was submitted. The 
computation of the modification s, halts at a 
perfect state apply (sy, Ege which we can des- 
cribe by an operation 

statend : State —> State 
with the equations 


statend Capply(s,, By) = apply (s)5 ) 


ae 
statend (in(s,, zy)? = apply (Ss, Beye 
3.4.3. The "binding-operator" equations 
3.4.3.1. Communication axiomatisation 


In the above section no equations was given to 
define the "send" and the "receive" operations 
(ie modifications predicated by the "iostm?" 
hidden operation). This is due to the approach 
adopted and to the fact that the communication 
type [10] requested by the language still inten- 
tionally unknown. So different CSPD can be taken 
into account, according to the choice of the 
communication type (which may be synchrone, 
asynchrone, deterministic, nondeterministic and 
so on) 3; and it seems possible to give a com- 
munication axiomatisation by means of the same 
convenient tool. 

To be succintly, we treat below two communica- 
tion types : 

- a deterministic and synchronised communica- 
tion like it occurs in [6] : the rendez-vous 
protocol oblige the process reaching a communi- 
cation command (in other words, able to commu- 
nicate) to wait for the corresponding comple- 
mentary command in its interlocutor. 

~ a nondeterministic and asynchronised communi- 
cation, where the producer and the consumer 

are totally independent. 


Remarks ; 

(1) No equations can be given to axiomatize the 
communication synchronised by a rendez-vous pro- 
tocol because it depends on the processes capa- 
bilities. 

(2) A communication of the second type can be 
seen as a basic modification on a predefined 
identifiers attached to the processes to hold 
messages. 


(*) Notice that we have the following equality 


in (s',<o,t,s>) = tf s = nop then <o,t,s> 
else in (s', 
statend (<o,t,s>)) ft 


apply (P.Q!e, <o- (P),t,m>) = <2" ,t',nop> 
let Be RG (P),t,m> st precedes (t,t") 
and mt = <subst (val (eval oe sh), 
eval (e,2))> 
apply (P.Q?v, <a. (P), t,m>) = 
apply (v:= ie <o, (P),t,m>) 


Where 7 and 7 are those predefined identi- 
fiers aktacherd respectivelly to the processes 
P and Q states. 

(t and wt type depends on the communication 
type. It ¢an be a queue, an infinite queue, a 
buffer with one element, and so on). 


In the above equation T 1s a copy of m_ deli- 
vered by the producer to the consumer :*that 
will permit handling a totally independent 
production and consommation. We suppose that 
to be brief in describing the concurrency 
axlomatisation. 


3.4.3.2. Concurrency axiomatisation 


The computation of a system designed by the 
communicating processes X and Y starts in a int 
tial general (and perfect) state a =< 2 (X), 
2 .(¥)> and may produce a set of such couples 
aS final state. We also include as possible 
final state : "failure." and "abortion, " 
which record respectivélly an erroneous Compu- 
tation and an abortion in one component. 
This computation is described by transforma- 
tions of the general state. To tackle the 
semantic equational definition of this parallel 
computation an (binding) operation is introdu- 
ced (at the interpretation level of the hie- 
rarchical construction of the CSPD type) : 

exec : Progr x State -——> State 
Knowning at time t the general state Y. compo- 
sed at t by £. (X) and £ (Y) the states ‘of the 
disjoints processes X and Y, we want to descri- 
be formally 

exec (X :: x, 3 Xo | | Yori yy 3 Yoo 

<1), zr @Y)>) 

This formalization depends on the form of the 
term x. and y. (i = 1...2) and on XY (X) and 
a (Y)“which Gan be perfect or not perfect. 


Before tackling this formalization, we want to 
point out the fact that it is a little cumber- 
some. It 1s the reason why we induce it on the 
structure of modification terms. 


let LeState ; d1,d2,d3,d4e Date 
st oh = <i CX), z(Y)>, 
di = datend (apply (x5 Z CX))) 
d2 = datend (apply (x), 2, (Y))) 
d3 = datend (2, (X)) 
d4 = datend (ZY) 


tn 


(1) Basic modification 


bstm? (7) os bstm? (y,) => 


exec (X :: x 5 Xy Bie tt ¥, 3 Vos z) = 
exec (X:: a2 || Yi: y, Zz") 
st case 


perfect? (2 0x)) perfect? (2 (Y)) => 
» 2 = KX, 
29 = Vp 
. tf precedes (d1, d2)then Z1 = <apply 
(x45 2 OX), in (y, 2, 0Y))> 
elstf precedes (d2, d1) then zt = 
(x,5 2,00), apply C2602)? 
else mt = 


<in 


apply (x,, LO), apply (y,> 
zCY))> FL: 

perfect? (ZOO) * 71 perfect? (z (Y)) => 

_ x= % 

-Y =I 5 Yo 

. tf precedes (d1, d4) then Bie = 

<apply (x, ZOD), z(Y)> 
elstf precedes (d4, d1) then ae = 
ZO), statend (2, CY) )> 

<apply (x, 2, (X)), 
statend (2X) )> fi 


<in(x,, 
tas 
else : 


_ |perfect? (2, (K)) A perfect? (2 (Y)) 


- 2 = X13 Xy 
eh) 
. tf precedes (d3,d2) then zt = <r, OX), 
apply (y,, ZY) )> 
<statend 


elstf precedes (d2,d3) then aa 
(2), in (y,> ZY) )> 
else zt = <statend (2), apply (yy: 
zOY))> To 3s 


esac ; 


(2) Communication modification 


(2.1) Case of deterministic and- synchronous 
communication 

iostm?(x,) “ bstm?(y,) = 

- ¥4 > Yo: ae, = 


oo ' 
| YoSs Ys. 2 > 


exec (X i: x, 3 X, | | Y : 


1 


exec(X :: X43 %Xy | 


st case 
perfect? (2.00) ” perfect? (2. CY) )> => 
Y = Yo 


roa F 
ee <a (CX), apply (y,, ZC) )> ; 
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perfect? (Z, (X))*T perfect 2(E, (x) => 
-Y = Y¥43 Vo 
; t= <tr CX), statend (ZY) )> ; 
\perfect? (2, (X)) % perfect? (ZY) => 


- Y= Yo 

; tf precedes (d3,d2) then a = <statend 
(2, X)), in Cy, 2%, OX) )> 

elsy precedes (d2,d3) then BT = <z CX), 


apply Cy, 22, 0%) )> 
then %', = <statend (2), apply (yy 
E(Y))> fe 3 
esac ; 


% bstm ? (x,) /\ iostm? (y,) => Simular to the 
preceding equation, due to the communicati- 


vity of the "parall" operation Z 
iostm? (x,) “~ iostm? (y,) => 
exec(X :: x, 3 x, | | Y-.¢% ¥, 3 Yoo <2 (X), 

be (Y)>) = 
tf perfect? (2, Ok)) perfect? (ZY) then 
aa x, = X.Y! e, 
13 Xo ab 


= 9 
“A yy Y.Xtv, 
22 Vos <2, 
apply (v, 7= e), a CY))>) 

K.Yivi® Wace Y.X!e 


(X), 


then exec (X 


elstf x, = 


1 1 2 
then exec (X:: Xo hee Yoo 
<apply (v, T= 5s ZOO), i CY) >) 
elstf x, = X.Xlesv x, = X.X?v, v 
= t V — ? 
yy Y.Yte, yy Y.Ytve 
then failure, 
then abortion 
; t 
rae 
else tf perfect? (2 (X)) 
then exec (X:: X13 x, | | Yo 


Y4 3 Vos 
<a CX), statend (2, Y))>) 
else Z perfect? (2 (Y)) i 


exec (X :: x 


ee oe eee 


<statend (2), z_CY)>) 
ft | 
ba 
(2.2) Case of nondeterministic and asynchro- 
nous communication 
iostm? (a4 bstm? (y,) => 
Xo hy 


of x, = X.X!ey x, = 


exec (X?: x43 2 Y43 Yoo z) = 


X.X?v then failure, 


else exec (X:: x2 || Yo: y gh et 


st case perfect? (oS (X)) => 


a x, = X.Yle then x =e €; X») 
else 2 x, = X.Y?2v 2 we =(v r= 7 +: x,) 
1 go 
ts a 
ros. 
ee 


perfect? (2 CY) => 


oS ee 


1 
-Y¥ = 


. tf precedes (d3,d2) then = 
(2 CX), in (y,> z(Y))> 
elstf precedes (d2,d3) then aes <r OX), 
apply (Cy, 2, (¥))> 
else zt =<statend (z,.09), apply (y,> 


ZY) )> fe & 


2 


<statend 


e&aC § 


% bstm ? (x,) Niostm? (y,) => similar to the 
preceding equation Z 


iostm? (x1) “~ iostm? (y,) => 
je & || Yor: 
if x, = X.Xte, V x, = 
VY, = Y.Yiv, 


e7.ge 


¥, 3 Yoo 2) = 


X.X?viv Y, = Y.Y!e 


exec (X i: x 


2 


then failure, 


exec (X ::23 oye DEY 3 Vos Z fe 


st vase perfect? (2, (X)) A perfect? (2 OY)) => 
ea x, = X.Y!e, then x = (T = e,) else 
= ? — = 7 : 
vi x X.¥?v, Tena (vy nee 
. = ! = = 
eee yy Y.Xte, then y (n, : e, else 


ny 4 = Y.X?v., AY\¥y = (v, r= 7 =) ft 


perfect? (2 CO)" | perfect? (2 CY)) => 


e. x, = X.Y!e then x= i ae e,else 
= ? _ -= 7 . 
yA xy X.Ytviz ev, : i EL 
Y= Yy 
‘| perfect? (2 (X)) WN perfect? (2 (Y) 
c= x, 
, = ! = ‘= 
; tf y, Y.X!le, then y aS - e,) else 
= ? = = ; 
vi yy Y.X2vi% y (v., : T Ft 


esac ; 
(3) Conditional modification 


cstm? (x,) “NN perfect ?(z, (X)) = 


exec (Xi: x, 3 x, Py eae Y, 3 Yo 
< = 
OX), 6%) >) 
let x, = if b then Sy else Ss, fi 


<n 


tf eval (b, zx) = true 
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then exec (X :: S| ; Xo | | Y 3: yy > Yo: 
<r, (X), Z(Y)>) 
elsif eval. (b, ZX) = false 
then exec (X :: S53 Xo | | ‘ae Y43 Yo» 


<2 CX), ZY) >) 
else failure, 


ft 


%Z cstm? (y,) “% perfect? (2 CY) => similar to 
the preceding one. 
cstm? (x,) ~ |perfect (2. (X)) = not given 
here to shorten the paper. 
This equation depends on the form of y,. 
cstm? (y,) ~*~ ]perfect? (2.(Y)) => similar to 
the ereeedi ne one % . 


3.4.3.3. General remarks : 


(1) The formalization is involved in three 
kinds of equations (analoguous one are also 


written) depending on the form of the modifi- 
cation considered. 


(2) The evaluation of boolean expression re- 
sults in some cases undefined values. This is 
necessary for describing loops within a sin- 
gle process. 


(3) Successful termination of the system desi- 
ened by two processes is noted by a couple 

of complete states different from "failure" 
and “‘abortion’’. 


(4) Deadlock arise when the processes are in- 
volved in some cyclic deterministic and syn- 
chronuous communication. 


(5) The definitions above presuppose that exec 
will never be supplied with a pair of states 
<2 CX), ~ (Y)> both of which are imperfect. The 
partial function can easily be extended to co- 
ver the missing case 


“| perfect (2, (X)) “| perfect? (2, CY) ) => 
LM iy <E.(M), 2 (Y)>) = 
tf precedes (datend (2X), 

(Z, @))) 

(KX oo 1] Yeo, 

(2.0), ZY) >) 

(Xi: x || Yor y, 

<a), statend (2 CY) )>) Tt 


exec (X :: 


datend 
then exec 
<statend 


else exec 


This would hold for all forms of x and y. 


4. CONCLUSION : 


Switching on to concurrency and communication, 
some inherent computational complexity arises 
and makes necessary the introduction of time 
notion. 


Commonly this notion is considered as an imple- 
mentation tool, we regard it to be a high abs- 
traction level in specifying parallel programs 
and formalizing its semantics (similarly to 

the sequence notion introduced to formalize 
semantics of sequential programming). 


Defining the meaning of systems of communica- 
ting processes in a "mathematical" way, the 
main problem is to find appropriate domains 
for the semantical models. 


To avoid these crucial problems, purely alge- 
braic methods are used. The presented abstract 
type takes together into account the phrases 
of the language and elements allowing to ex- 
press their meaning, like states, "eval", 
"apply" and "exec'’ operations. 


Remark that if the type is relatively complex, 
the description of phrases are more simple 
(language level of the hierarchical construc" 
tion, the other levels being used to write 
axioms). In addition, different communication 
types can be easily handled in an elegant and 
a simple manner. 


Extending the method to treat more complex 
constructs, especially those inducing non- 
determinism and showing its complementarity 
with axiomatic methods (proof rules), remains 
to developp. 
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ABSTRACT 


The Poker Parallel Programming Environ- 
ment is a graphics-based, interactive system for 
programming the Configurable, Highly Parallel 
(CHiP) Computer. Designed to support nearly 
all aspects of parallel programming in one 
integrated system, Poker has been implemented 
as a «35,000 line) C program on the VAX 
11/780 under UNIX. It prevides a number of 
novel features including graphics programming 
of parallel processor communication. 


Although much sequential programming can be 
accomplished with only the support of a programming 
language compiler, loader and run-time system, parallel 
programming is too complex to be done with such rudi- 
mentary facilities alone. The Poker System is an interac- 
tive programming environment to _ support. the 
Configurable, Highly Parallel (CHiP) Computer [1]. The 
Poker System is not itself a parallel program, but rather 
it is a "front end”, implemented on the VAX/780 under 
UNIX. It is a front end to a preprototype version of the 
CHiP hardware, called the Pringle, which is a 64 proces- 
sor parallel computer emulating the CHiP [2]. In addi- 
tion, Poker is a front end for a complete software emu- 
lator for the Pringle. 


The Poker System enables the programmer to 
define a large family of CHiP architectures with 4 to 
4096 processors. Programs can be written and emulated 
for any family member. Facilities are provided to define 
processor interconnection structures graphically, to pro- 
gram the processing elements, to compile, coordinate [3], 
and load, to perform single and multistep execution, and 
to peek and poke (whence the name) at the memory of 
the architectures. 


CHiP Programming is Something Else 


The programming environment provided by Poker 
is somewhat unconventional due partly to novel proper- 
ties of the CHiP Computer and partly to novel proper- 
ties of the system itself. To increase the accessibility of 
subsequent sections, we discuss here the activity of CHiP 
programming and the role Poker plays. 


Programming, of course, is the conversion of an 
(abstract) algorithm that is “machine independent” into a 
form suitable for execution on a particular computer. 
Thus, to begin programming a CHiP machine, we need 
to have a parallel algorithm in mind. The algorithm is 
presumed to have the form of a graph whose vertices are 
processes and whose edges specify the communication 
paths among the processes. 


This work is part of the Blue CHiP Project. It is supported in 
part by the Office of Naval Research Contracts N00014-80-K- 
0816 and N00014-81-K-0360. The latter is Task SRO-100. 


0190-3918/83/0000/0289$01.00 © 1983 IEEE 
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For example, Figure 1 gives an algorithm that uses a 
binary tree as the communication graph. The algorithm 
finds the maximum of a set of numbers (stored one per 
process in a local variable called "val”) and then multi- 
plies each number by the maximum. The maximum is 
found by “floating” the largest value in each subtree to 
the root of that subtree. Then the global maximum is 
broadcast back through the tree where each process 
multiplies it times its local “val.” Notice that although 
there are fifteen processes in the tree, there are only 
three types of processes used. 

The conversion of this algorithm to run on a CHiP 
computer, i.e., the programming, is straight fo:ward.” It 
involves 


(a) embedding the communication graph into the 
switch lattice, 

(b) programming the process types in a sequential 
programming language, 

(c) assigning one of the process types to each pro- 
cessor, 

(d) naming the data path ports, and 

(e) compiling, assembling, coordinating, and load- 


ing the program. 


We consider each of these activities in turn. 


Embedding the communication graph into the 
switch lattice requires that we program the switches of 
the lattice so that the processors have a topology that 
matches (or is a super set of) the topology of the com- 
munication graph. This embedding operation is done 
graphically (rather than symbolically) in the Poker 
System using the Switch Settings mode. Figure 2 illus- 
trates a particular embedding of the fifteen node binary 
tree into the lattice. Processor (1,2) is the root of the 
processor tree, processor (1,1) is a leaf, and processor 
(1,3) is unused. 

Next we program the three process types in a 
sequential language, XX. Each process is viewed as a 
procedure with (optional) parameters ard local vari- 
ables. In addition to the usual declarations we must 
specify the port names, symbolic names used by a process 
to refer to other processes with which it communicates. 
Figure 3 shows the XX code for the three process types. 
In the programs the symbol ’<-’ is used for 
input/output; assigning to a port name, e.g., PARENT 
<- val, causes output, and assigning from a port name, 
e.g.,. max <- PARENT, causes input. 


The construction of the processor tree in the switch 
lattice to match the communications graph gives an 
implicit association between the processes of the algo- 
rithm and the processors. We make this relationship 


explicit by assigning process names to the appropriate 
processors using the Code Names mode of the Poker 
System. Figure 4 gives the result. 


Next, the port names mentioned in each process 
must be associated with a specific data path. Each pro- 
cessor has eight ports corresponding to the compass 
points. Only those connected by an active data path to 
another PE need be named. This activity is performed 
using the Port Names mode of Poker. Figure 5 shows 
the result of naming the ports. 


The algorithm is now programmed. Next, each pro- 
cess type mentioned in the Code Names specification is 
compiled into assembly code. The assembly code is then 
“coordinated,” i.e., modified so that the CHiP Computer 
can run it synchronously. The coordinated programs are 
assembled to produce processor object code. The inter- 
connection structure is “compiled” to produce switch 
object code. The object codes are loaded into the 
machine and executed. 


Description of the Poker Environment 


In the last section we used the Poker Programming 
Environment to embed graphs, to define processes, to 
assign processes to processors, and to declare port 
names. The discussion implied the existence of certain 
facilities in the Poker System. Now we give a more com- 
plete description of those facilities. . 


The Poker System is an interactive programming 
environment that uses two displays: The primary display 
is a high resolution (1024 x 768 pixel) bit-mapped 
display, and the secondary display is a conventional 
character display. The two displays are used to increase 
the amount of information available to the programmer. 
Most activity takes place on the primary display; XX 
programming is usually done on the secondary display. 


The primary display has the form illustrated in Fig- 
ures 2-5. The bottom square region, called the field, is 
where most of the programming activity takes place. 
The field always displays some schematic representation 
of the two dimensional array of processors being pro- 
grammed. The exact form of the representation changes 
depending on whether the programmer is performing a 
graph embedding, a process assignment, a port declara- 
tion, etc. Since the field is not always large enough to 
show the whole schematic representation, a map of that 
portion being displayed is given in the upper left-hand 
corner. Status information, diagnostics and miscellane- 
ous data are given in the upper right region of the 
display, called the chalkboard. The bottom line of the 
chalkboard is the command line, used for specifying the 
few textual commands” required by Poker, such as read- 
ing library files. 


The logical structure of the Poker Environment is 
shown in Figure 6. It provides an integrated set of facil- 
ities to 


® define architecture characteristics (CHiP 


Parameter), 

@ embed communication graphs (Switch Set- 
tings), 

® program process code segments (XX 
language), 


CO A EE A ce eR 


*Poker is completely interactive; most actions are given as a single 
key stroke and have immediate effect. 
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@® assign processes to processors (Code Nam- 
ing), 
@ declare port names (Port Naming), 


® compile, coordinate, assemble and load 
(Command Request), 


@ execute, trace, peek and poke (Trace 
Values). 


We now describe each of these facilities in detail. 


Architectural definition. Because Poker is intended to be 
a laboratory tool for studying CHiP programming, it has 
been designed to support a number of CHiP family 
architectures. Programs can be written for logical CHiP 
machines with from 4 to 4096 processors. All of these 
logical machines can be emulated using a software emu- 
lator, and one family member, the 64 processor version, 
will be able to be run on a hardware emulator, the Prin- 
gle [2], when it is completed. Consequently, the pro- 
grammer begins using Poker by specifying the charac- 
teristics of the underlying logical architecture. These 
include the number of processing elements and the 
amount of routing capability needed for the lattice (cor- 
ridor width [1]). The default parameters are those that 
match the machine defined in the previous session, or, if 
there was none, then the parameters of the Pringle 
hardware. 


Graph embedding. The field of the primary display 
shows the lattice of the current architecture, as illus- 
trated in Figure 1. The activity is largely that previously 
described; the programmer connects the processors 
(represented as boxes) with line segments to define 
edges. Graphics primitives based on cursor keys permit 
edges to be drawn and erased. Facilities are available 
for following graph edges, managing the display (e.g., 
centering), saving embeddings, reading in library embed- 
dings, etc. 


Programming the process code segment. The XX 
(Dos Equis) sequential programming language is a sim- 
ple scalar language for defining processes. The language 
has four data types (Boolean, character, integer and 
real), the common control structures (while, for, if- 
then-else, etc.), vectors and the usual supply of scalar 
arithmetic and logical operators. In addition to data 
type declarations, one can also declare scalar variables to 
be port names, procedure parameters, or variables to be 
traced. Input/output is performed by assigning from or 
to a port name. The semantics are “data-driven:” writes 
occur immediately and reads wait on the arrival of data, 
if necessary. XX process codes are generally developed 
on the secondary display using a standard editor. 


Process assignment. The processors are assigned 
processes using a field display on the primary terminal 
like those in Figure 4. The programmer enters the name 
of the process procedure on the first line of the proces- 
sor box. If the procedure has formal parameters, then 
values for the actual parameters can be entered on the 
following (four) lines. Facilities are provided for 
buffering the contents of a box and then automatically 
depositing the contents of the buffer into processors in 
whole regions of the processor array. In this way the 


programmer is saved from manually entering repeated 
information when the algorithm exhibits uniformity. 


Port declarations. The field of the primary display 
has the form illustrated in Figure 5. Each processor has 
up to eight incident edges as a result of the graph 
embedding, and it has been assigned a process which 
refers to up to eight port names. These are matched 


using the port declaration. The processor box is divided 
into eight windows: 


home 


n port 
ne port 
nw port 
. e. DOLE 

w port 
P se port 


Sw port s port 


The programmer enters the names used by the assigned 
process code into the window for that edge. The names 
are clipped to the first five characters. Facilities are pro- 
vided for displaying unclipped names in the chalkboard, 
and like the process assignment, it is possible to buffer 
port assignments and deposit them automatically in 
whole regions of the processor array. 


Program translation. The preceding facilities pro- 
vide a means of specifying the elements of a Poker pro- 
gram. They are then converted into executable form. 
The XX compiler converts each process to assembly 
code. The coordinator [3] then attempts to convert the 
process assigned to each processor into a form that per- 
mits the entire program to run with synchronous ({i.e., 
not data-driven) execution. [This step can be by-passed 
and the processes can be run in data-driven form.] If 
coordination is successful, the processors may all have 
different assembly codes associated with them. In any 
event the assembly converts the assembler code to object 
form. The connector “compiles” the graphical represen- 
tation of the communication graph into an object form. 
The object code and the object graph as well as the 
actual parameter values are loaded into the emulator (or 
the Pringle). 


Execution. The resulting program is executed and 
the traced variables are displayed; the field is similar to 
that used for process assignment. The execution can 
proceed for a given number of steps, or until a displayed 
value changes. When the execution is suspended, any of 
the displayed values can be changed. When execution 
resumes these new values are poked back into the pro- 
cessor memory. 


Further detail about the Poker Environment can be 
found in the references [4,5]. 
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leaf process: 

write val to parent; 
read max from parent; 
val + val - max; 


root process: 
read x from left child; 


ancestor process: 

read x from left child; 

read y from right child; 

write max (x,y, val) to parent; 
read max from parent; 

write max to left child; 

write max to right child; 

val ~ val - max; 


read y from right child; 
max + max (x,y, vail); 
write max to left child; 
write max to right child; 
val + val - max; 


Figure 1. An algorithm; each leaf is an instance of 
the leaf process, the root is an instance of 
the root process and all other nodes are in- 


stances of the ancestor process. 
Wed May 18 10:34 MODE: Switch Setting 
PHASE: 1 LAST PE: 3 4 SAVED PE: NONE 
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Figure 2. An embedding of the 15 node binary tree. 


code leaf (val); 
ports PARENT; 
begin | 

int max, PARENT; 
PARENT <- val; 
max <- PARENT; 
val:=val * max; 
end. 


code root (val); 
ports LCHILD, RCHILD; 
begin 
int x,y, max, val, 
LCHILD, RCHILD; 
x <- LCHILD; 
y <- RCHILD; 
if x> y then max:=x 
else max:=y; 
if val > max then max:=val; 
LCHILD <- max; 
RCHILD <- max; 
val:=val * max; 
end. 


‘code ancestor (val); 


ports PARENT,LCHILD.RCHILD; 
begin 
int x,y, max, val, 
PARENT, LCHILD, RCHILD; 
x <- LCHILD; 
y <- RCHILD; 


if x>y then max:=x 


else max:=y; 
if val > max then max:=val; 
PARENT < - max; 
max <- PARENT; 
LCHILD <- max; 
RCHILD <- max; 
val:=val * max; 
end. 


Figure 3. Code for the three process types. 


Wed May 18 
PHASE : 


E: 1 LAST PE: 3 3 


10:38 MODE: Code Names - interconnect 
SAVED PE: NONE 


Figure 4. Assignment of process names to processors; 
note that the name “ancestor” has been 


clipped to five characters. | 
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MODE: Port Names - interconnect 


Wed May 18 10:42 
PHASE: 1 LAST PE: 1 4 SAVED PE: NONE 


The specification of the port names; note 
that the names have been clipped to the 
first five characters. 


Figure 5. 
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CHiP Parameters 
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Switch Settings 


Process Set Def.: 
XX Programming Lang. 


Process Assignment: 
Code Names 


Port Declarations: 
Port Names 


Program Translation: 
Command Request 


Mode 
Select 


XX Compiler 


Assembler 


| connector —_| 


Loader 


Pringle 
Emulator 


Execution 
Trace Values 


Figure 6. The logical structure of the Poker Environment. 


A High Level Analysis Tool 
For Concurrent Programs 


Paolo Mancarella and Franco Turini 


Dipartimento di Informatica, 


C.so Italia,40O, 


ABSTRACT 
The design of a tool which analyzes Concurrent 
Sequential Processes-—like programs is presented. 
The purpose of the analyzer is to favor the 
understanding of a concurrent program via its 
simplification with respect’ to 
interactively provided by the user. 
can be either on the input data or on possible 


constraints 
Constraints 


synchronizations. The analyzer is based on 
parallelism removal techniques and _ symbolic 
evaluation techniques. 
1. INTRODUCTION 

Much research effort has been recently 


devoted to the design of languages for parallel 
processing and to the study of their formal 
semantics /5,6,11/. Several basic principles, 
like separation of 
workspaces are now widely accepted. It is then 


time to begin the design of suitable programming 


message passing and 


environments for such a class of languages. 
Among the tools one would like to have there are 
the so called static analyzers, i.e. 


able to favor the discovery of properties of the 


programs 


run-time behavior of parallel programs. 

The paper reports on the design of a tool for 
analyzing concurrent programs written in a CSP- 
like language. The basic idea is to apply the 
technique of symbolic evaluation, which has been 
successfully applied to sequential programs, to 
parallel ones. The process of symbolically 
evaluating a program is based on the ability of 
executing it on partially specified input data. 
Such an ability can be exploited to several 
purposes: | 


(1) To obtain symbolic’ statements, i.e. 
logical assertions involving the variables 
of the program, about the run-time 


behavior /1,3/. 


This work was partly supported by CNR-PFI under 
contract No. 81.02053.97. 


0190-3918/83/0000/0293$01.00 © 1983 IEEE 
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(2) To prove the correctness of a program, 
it by 


assertions /7/. 


when is annotated suitable 


(3) To partially compile a program in order 
either to obtain a more efficient one on a 
subset of possible input data or to get a 
better understanding of the behavior of 
the program /4,13/. | 


Our approach intends to exploit symbolic 


evaluation techniques to obtain a better 
understanding of the behavior of parallel 
programs under explicit hypotheses on _ the 


behavior of its environment. Such hypotheses 


include restrictions on inputs received from 


outside. 
Basically, the system accepts a parallel 
program and transforms it in two respects: 

- It tries to resolve part of the 
parallelism by substituting it with 
sequential nondeterministic code. 

_ It annotates control points of the 


resulting program with assertions about 
the state at those points. 


The 
guided by the user, 


transformations are interactively 
who, acting as the external 
force the behavior of the 
program either providing restrictions on the 
input data forcing certain 
synchronizations instead of others. 
The attempt of eliminating paralielism is 
motivated by the desire of obtainig a better 
understanding of the program. In fact, on one 
hand we expect that sequential reasoning is 
easier than parallel reasoning, on the other 
hand putting together two parallel 
allows to move symbolic information from one 
process to the other, possibly leading to the 
Simplification of both. The design of 
parallelism removal has been deeply influenced 
by several approaches to the semantics of CSP- 
like languages. Indeed, several authors /5,11/ 
propose to reduce the notion of paralleism to 


environment, can 


or process 


modules 


the notion of sequentiality plus nondeterminisn, 


ultimately denoting a parallel program by a 
multi-valued function. 
Our analysis tool is organized in three 


steps: 

- First of all, a process is analyzed in 
isolation in order to annotate it with 
assertions about its local symbolic 
states. 


-_ Second of all, 
matched two by two in order to eliminate 


processes are iteratively 


parallelism. 
- Finally program transformation techniques 
are applied to the resulting program... 

The paper is organized as follows: section 
1 describes the parallel language we are dealing 
with. Section 2 describes the tool in some 
detail, while section 3 contains an example 
which puts in focus the characteristics and the 
possible use of the tool. 


2. THE LANGUAGE CSP-L 


The language is a slight modification of 
Communicating Sequential Processes /6/. The 
differences do not concern the basic features of 
the language, message passing, 
nondeterminism. and static configuration of the 
program. 


i.e. 


The syntax of CSP-L is described in a BNF- 
like form, where {x}* denotes 'n' occurrences of 
'x', n>O, and {x} denotes at most one occurrence 
of x. 


<system> ::= 
<process > 
< command> ::= 


<process> {|| process } * 

[ <command> ] 

skip |<assignment> | < input> | 
<output> | <composite_command> | 
<alternation> | <repetition> 

::= < variable> 


-s= <name> 2; 


< assignment > :=< expression > 
<input > ::=<channel id> ? < variable> 
< output > ::=<channel id>! <expression > 
< composite _command> ::= < command > ; < command > 
::=[ <guarded_command> 

{1 <guarded command> }* ] 
< guarded command > : 


< alternation > 


t= <guard> +> xcommand> 
< guard >::= <boolean_ expression > 
< repetition > ::= *[<do guard command> 

{) <do_guard_command>}*] 
<do_guard command> ::= <guard> + <do_body > 


<do_ body > ::= <command> |{ <command>;} EXIT 


The most important differences between CSP and 
CSP-L are the use of channels for communication 
and the definition of alternative and repetitive 
commands. In fact, the former does not deal with 
input guards and the latter has an explicit exit 
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condition. some motivations for Similar 
modifications can be found in /12/. 
3. THE ANALYSIS TOOL 

The input of the analysis tool is the 
abstract syntax tree of the program. Each step 


performs some kind of tree transformation. The 


transformations come in two categories: 

- annotating the arcs of the trees with 
assertions. 

- transforming the trees by merging two of 
them together, by eliminating subtrees and 


so on. 


Before tackling a description of the phases 
of the analyzer it is necessary to spend some 
words about what we mean by symbolic constants 
and symbolic states. 
subset of a data 


By symbolic constants we 
mean a domain. Symbolic 
constants are represented by predicates. If 


p: D + Bool 


is a predicate on the data domain D, oh. 
represent the symbolic constant 

A= {x / p(x) = true}. 
Applying a basic operation to a_= symbolic 
constant is then a predicate transforming 


operation. For example, let 
A= {x/x> 10} ; 

then "A + 1" must result into the predicate 
A={x/x> 11}. 

An important issue in designing a symbolic 
evaluator along these lines is then to provide a 
“predicate transformer semantics" to each 
primitive operation of the data types allowed by 
the language. This issue will not be considered 
further in this paper and it is dealt with in 
full in /13/. 

A symbolic state is a set of bindings among 
identifiers and symbolic constants. 

From now on by symbolically evaluating a 
(sequential) program we intend the ability of 
computing the symbolic state associated to every 
control point of the program by propagating the 
initial symbolic constants through the control 
paths. 

The three phases of the tool are named Al, 
A2, A3 respectively. 


3.1. First Phase : Al 

The purpose of Al is to simplify a tree 
eliminating assignments and sequentialization 
nodes. Assignment nodes are eliminated by 
annotating the arcs of the tree with symbolic 
states which keep track of their effects. 
Sequentialization nodes are eliminated by 
embedding the control flow into the structure of 


the tree. 
For example the fragment of process 

3 CORSE xX Se F(R) oe = el) ALY eas 
"d'' are channel names and "f" 
"o" are elementary operations, is transformed by 
Al into the subtree: 


where '"'c" and and 


BUG. 
where "ff," and "g,' are the symbolic constants 
resulting from the computation of "f" and "g" 
respectively. 
The elimination of assignments nodes is 
particularly important because it reduces the 
necessity of interleaving the independent 


actions of two processes during their merging. 
During this phase the symbolic evaluation 


of the process is performed locally. This is to 


say that Al is unable to propagate symbolic 
constants beyond certain points of control 
because of the lack of information. First of 


Fe ts le De 
after which the evaluation is restarted with an 
undefined symbolic state (i.e. the symbolic 
value associated to each variable of the program 
data 
Al just assumes that the communication 
but it has no 


input nodes are considered by Al points 


is a predicate whole 


domain). 


representing the 


will effectively take place, 
information about the value assigned to_ the 
input variable by the communication. 

The choice of making the whole state of the 
bit too 
Indeed Al could propagate the part of 
the symbolic state which is not affected by the 
the other hand, A3 will 
recompute symbolically the whole program to get 


computation undefined may appear a 


extreme. 
communication. On 


advantage of the information gained by A2, hence 
the full symbolic evaluation is deferred to A3. 

Secondly, loops in the process induce another 
class of cut-points for obvious reasons. The 
subtree representing a loop, produced by Al is 


of the kind: 


S 


Cor) 


fig. 2 


where "s" is the symbolic state valid before 
entering the loop and "|" is the undefined 
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symbolic state. 
It is worth noting that at this point it does 
not make any sense trying to apply techniques 


for synthesizing loop invariants, Since the 
control structure of the program obtained by 
merging several 
different. 

In summary the tree resulting from _ the 


application of Al has paths of the kind: 


processes may be radically 


Lig, 2 
where the symbolic states "Si'' are independent. 
For example "s2"' retains the effect of the 
possible assignments between "N1" and "N2" but 
no information derived from "sl". 
3.2. Second phase : A2 
A2 is designed to merge together’ two 


symbolic trees, 
to apply Ae, 
trees 


yielding a new one: the idea is 
step by step, to pairs of symbolic 
constructed by Al or by 
applications of A2 itself, 


previous 
in order to obtain a 
symbolic tree describing the control structure 
of the whole program. Given two symbolic trees 
Tl and T2, the result of evaluating A2[T1;T2] is 
a symbolic the 

structure of the compound process P1||P2. 


tree T embedding control 

The purpose of A2 is to solve the internal 
communications of two processes, keeping track 
of them in the symbolic states of the new tree. 
In some sense A2 the parallel 
execution of the in that it 
maintains and updates two continuation points in 
the two processes and produces new nodes of the 
final tree, the 


Simulates 
two processes 


depending upon currently 
examined nodes of the input trees. 

A2 is recursively defined by cases and we show 
here the most important ones. As before, a 
graphical representation of trees is used. 


Let Tl and T2 trees like: 


Te NE Te & N2. 


ce 


fig. 4 
— case i) 
let N1, N2 be two matching I/O nodes, i.e. they 
name the same internal channel. Assume, for 
example, that Nl corresponds to the input 
command "c?x" and N2 to the output command 
"clexp". The result is simply to modify the 


current symbolic state to take into account the 
"assignment" 

xX := exp 
and recursively apply A2 on T1' and T2' without 
generating any new node for the resulting tree. 
In all the other cases A2 does 
symbolic states. In other words, the symbolic 


evaluation is performed locally also by A2. 


not modify 


- case ii) 

Let Nl be an I/O internal node (i.e. 
channel common to the two processes) and N2 an 
external I/0 node it names a channel 
unknown to the other process). The only way the 
system of the two processes can proceed is that 
the communication named in N2 occurs. As a 
consequence A2 assumes the occurrence of such a 
communication re-applying to Tl and T2' and 
creating a copy of N2 for the resulting tree. 
Graphically: 


it names a 


(i.e. 


A2(T2; To], 
fie. 5 


— case iii) 

Let N1, N2 be internal unmatching nodes. N1 and 
N2 may not match either because the processes 
name different channels or because the 
communication requests are not complementary. 
This can be classified as a deadlock situation. 
The aim of our analyzer, at least in the present 
design, is to point out the behavior of the 
program in the absence of deadlock conditions. 
Hence A2 is designed to abort the execution 
path. The reason for this design decision is 
that the deadlock situations detected by A2 may 
be only apparent, 
occur 


such events will never 
in an actual execution, but they are 
generated by the order of application of A2. In 
other words, deadlock detection can be performed 
only by global reasoning on the whole program. 
Indeed, such a design choice guarantees the 
associativity of A2 /10/. 


128% 


— case iv) 

Let Nl and N2 be external I/O nodes (i.e. they 
name channels connecting the two processes to 
other processes or to the external world). In 
this situation no assumptions can be made about 
the order of the two communications. 
Consequently A2 generates a nondeterministic 
branch in the resulting tree as follows: 
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fig. 6 
where T' = A2[ T1' ; T2] 
Ts AC | TL. 2 -T2* je 


- other cases 
The behavior of A2 in the other cases is much 
simpler and will be made evident by the example. 


- termination conditions 
It is instead very 
termination of A2, 


important to discuss the 
i.e. the reasons why A2 is 
able to yield a finite tree. 


Obviously, A2 halts when it encounters the end 
of two paths of the input trees, 1.<e% the 
termination of execution paths for the two 


processes. 
Furthermore, A2 reaches a termination point when 
it detects a deadlock condition, 
earlier. 
Finally, 


aS pointed out 


A2 halts when it is considering two 
subtrees which have already been merged. In this 
case a re-entring arc to the subtree previously 
generated is added to_ the 
creating a cyclic path (i.e. 


resulting tree, 
a loop). Note that 
the presence of such paths implies that we are 
actually dealing with graphs. 


3.3. Third phase : A3 

When A2 has been iteratively run on pairs 
of processes to'obtain a single symbolic tree, 
A3 is run to simplify it. 
A3 is the phase which embeds most of the power 
of symbolic evaluation. Indeed it tries to 
propagate symbolic information through the whole 
tree and, furthermore, it tries to find simple 
invariants of loops. The symbolic information is 
also used to simplify the tree dropping the 
paths which can be taken by no actual execution. 
The two tasks of A3, 
and finding 


i.e. tree simplification 
reflect into 


forcing and non- 


invariants, two 
differents modes of working: 


forcing mode. 


— FORCING MODE 

When working in forcing mode, A3 propagates 
symbolic states and tries to drop paths; in this 
case the application of A3 to a boolean node can 
result into different actions: 


i) The current symbolic state implies that 
the guard The path is 
simplified eliminating the node. 


is always true. 


ii) The current symbolic state implies that 
the guard is always false. The path is 
eliminated from the tree. 

iii) None of the above implications is true. 


The only consequence is that the symbolic 
state propagated by A3 incorporates the 
guard (see the example). 

The application of A3 to an input node 
results into establishing an interaction with 
the user. The user is requested to simulate the 
external environment and to provide an input to 
the system. The symbolic 
constant which will be propagated by A3 through 
worth recalling that a 


input must be a 


the program. rt: 2s 
concrete value is a special kind of symbolic 


constant. 


— NON FORCING MODE 

When A3 encounters the initial node of a 
loop (which has been marked by A2 on creating a 
re-entering arc), it switches to non-forcing 
mode in the attempt of finding simple invariants 
for it. All the paths exiting from the node are 
walked through, Starting with the current 
symbolic state and propagating states around the 
loop without affecting the tree (i.e. without 
forcing boolean guards or eliminating paths). 
Whenever a re-entring arc is found, A3 
associates the current symbolic state to the 
loop node. In this way, when all the cyclic 
paths have been traversed the loop node mantains 
all the final states obtained by symbolically 
executing the loop starting with the symbolic 


state immediately preceding it (taken as an 
"attempt" state). At this point A3 intersects 
these states in order to find possible 


invariants for the loop. Finally A3 switches to 


"forcing" mode and restarts the analysis from 
the loop node, taking as current symbolic state 


the invariant one. 


Some other details about Al, A2 and A3 and their 
formal description in a LISP-like formalism can 
be found in /10/. 


4. AN EXAMPLE 


The previous sections have somewhat vaguely 
described the design of the analyzer. This 
section tries to complete the description 
showing how the analyzer works on a concrete 

this the not 
provide constraint on the 
the example is 


example user is 
any 


However, 


example. In 
supposed to 
external environment. 
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interesting because, also in this case, the 
process provide a better insight of the run-time 
behavior of the program. 

Let P the CSP-L program: 

P ::[ P1l||P2] 

where Pl and P2 are described as follows: 


Pl ::£ guard := true ; 
bap mma cae 
* [ guard > 
c ! min(I1) ; 
a? x 4 
[x < min(I1) + guard: = false 
f]}x >min(I1) > 


Il := ins (x; del(I1; min(I1)))] 
[] ~guard + EXIT] ] 


Pe +: [ test t= true ; 
r2 ? [2 ; 
*[ test > 
cP? y ; 
d ! max(I2) ; 
[ y => max(I2) 
fy <max(I2) 4 
I2 := ins (y; del(I2; max(I2)))] 
H~test + EXIT] ] 


+ test := false 


Intuitively the program P is designed to 
transform two sets of numbers, say Sl and S2, 
into two new sets, say 5S1' and S2', such that: 


Bie Us S2:=o1" 
and 
¥ x,y (xe S1' and ye S2') + x2y 


U Se! 


At the very beginning of the computation Pl and 
P2 read the initial sets on the 
si a era", Then they exchange 
elements on the internal channels "c" and "d", 
Let tl and t2 be the syntactic trees of Pl and 
Pe respectively. 


external 


channels and 


The symbolic tree shown in fig.7 is the 
result of Al[ti]. The structure of Al[t2] is 
quite The effect of assignment 
statements is embedded into the symbolic states 
associated with some arcs: they are the only 
relevant states in the tree (we choose to use a 
notation for symbolic expressions similar to the 
syntactic one for simplicity reasons: an 
effective implementation of our tool must deal 
with some concrete representation for symbolic 
constants, as in /13/). 


Similar. 


Let now be 
TL = AL { t1 7] 
d2 2. ALT t2):4 
the next step of the analysis is the merging 


phase, yielding the symbolic tree for the whole 
program; in other words we compute 

eae; ee i a a 
The initial nodes of Tl and T2 are two external 
communicating nodes: in this case A2 generates a 
nondeterministic branch and the~ resulting 
partial tree is pictured in fig.8. 
The subtrees T' and T'" are very similar because 
the symmetry between Tl and T2: Therefore we 
will consider only the structure of T'. 

During the analysis A2 reaches a point such 
that the intermediate tree has the structure 
shown in fig.9. 

The path labeled (*) corresponds to_ the 
situation in which A2 tries to visit the 
remaining path of T2, while a terminal node of 
Tl has already been reached. Just after the 


construction of the node (*), A2 visits the 
internal communication node of T2 
Co? i 


This is obviously a deadlock situation. Hence A2 
aborts the path yielding the tree pictured in 
fig.10. 

The final tree computed by A2 is shown in 
fig.11. 
It is worth noting that T has one re-entering 
node and that the subtree T" is very similar to 
the other one. The subtrees T3, T4, T5, T6, T7 
and T8 have been omitted because they will be 
suppressed by the third phase. 


First A3 works in forcing mode trying to 
simplify the tree constructed by A2. When it 
reaches the situation pictured in fig.12, the 
current symbolic state for the paths labeled (1) 
and (2) is: 

Ss = <guard . true ; 
ie test . true > 
It is worth noting that "s" implies the truth of 
the boolean guard 
pl = 'guard' 
and, conversely, the falseness of the boolean 
guard 
p2 = '~guard'. 
Consequently A3 deletes the path corresponding 
to "p2" (which will never be traversed in an 
actual execution) and simplifies the other path, 
obtaining the tree shown in fig.13. 

Let us now explain the behavior of A3 when 
the loop node is detected: the current symbolic 
state is: 


sl = < test. true ; 
guard . true ; 
y . min (11) ; 
x . max (I2) ; 
Ay eS 
I2. 4 > 


which is bound to the loop node. At this point 
the remaining subtree is the one shown in 
fig.14. 
A3 works now in non_forcing mode in the attempt 
of finding invariant assertions for the loop. 
The path with boolean guard 

Gl = ( y= max(I2) A x2 min(I1) ) 
is not explored because the current symbolic 
state is such that 

S > ~Gl. 

Analogous arguments are valid for the guard 

G2 = ( y < max(I2) a x > min(I1). 


The path starting with the boolean node 

G3 = (y = max(I2) Aa x < min(I1) 
is explored but it leads A3 to a terminal node 
(i.e. it is an exit path). On the other hand, on 
visiting the remaining path, A3 binds another 
symbolic state, say 54, to the loop node. The 
contents of 54 are the following: 


S4 = < guard . true; 
test . true; 
y - min(I1); 
x . max(I2); 
T1 . ins(x; del(I1; min(I1))); 
I2 . ins(y; del(I2; max(I2)))> 


At this point of the computation A3 gets 
the invariant state 


55 = Sl n 84 = 
<guard . true; 
test . true; 

y . min(I1); 

x . max(I2)> 


and restarts visiting the loop with the symbolic 
state S5 in forcing mode. Finally, note that the 
information provided by S5 leads to delete some 
paths in the symbolic tree: in particular those 
starting with the boolean guards Gl, G2, Gd 
which are always false. 

The final tree resulting from the analysis 
is pictured in fig.15. The symbolic states are 
the following: 


Sl = < guard . true; 
test . true > 


S2 = < guard . true; 
test . true; 
x . max(I2); 
y . min(I1) > 
S3 = < guard . false; 
test . false; 
y .- min(I1); 
x . max(I2)> 


true; 

true; 

Tl . ins(x'; del(I1; min(I1))); 
ins(y'; del(I2; max(I2))); 
y . min(I1); 

x . max(I2) > 


Note that the symbolic expressions in state 53 
must represent in some way the relationship 
min(I1) = max(I2). 

by x' and y' we mean the symbolic 


values hold by variables x and y before the 


Furthermore, 


symbolic execution of the matching 
communications 

c ! min(I1) --> c 7? y 

gd! max(I2) --> d? x. 
The analyzer has solved the internal 
communications and the final tree contains I/0 
nodes naming external channels only. This fact 
has allowed to move symbolic information from 
one process to the other and then to _ the 


resolution of some boolean guard during the 
forcing mode analysis of A3. 


5. CONCLUSIONS 


This paper reports on a two people paper 
project. The project is a small and "advanced" 
part of a larger one, whose aim is to build an 
integrated software environment for developing 
ADA programs in a-— distributed computing 
environment /9/. The data structures’ and 
algorithms for an implementation of the analyzer 
are described in full in /10/ along with other 
examples. A simplified CSP-like language has 
of ADA 
principal goal 


chosen instead for 
Indeed the 
to 


based 


been simplicity 
of the 
demonstrate that high level 
the of 


evaluation is of some use also in a parallel 


reasons. 
authors was 


analysis on paradigm symbolic 


programming framework. Planned developments of 


the project are: 


-~ An experimental 
The reasons 


implementation in Lisp. 
for choosing Lisp are its 
orientation to tree manipulation and, more 
importantly, the availability of a 
Simplifier already written for a symbolic 
interpreter. 
- Refinements of the including 
improving the analysis of loops following 
ideas published in /1/, designing a 
friendly interface to make the user able 
to display and inspect the symbolic trees 


design 
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12s 


133 


produced by the analyzer and adding to the 
system other composition operators besides 


the symmetric composition. It seems 
interesting, for example, to merge an user 
process and a server process from the 


viewpoint of the user, in order to gain 
information about the behavior of the user 
process when it gets the service without 


having to consider the other parts of the 


server. 
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Abstract 
Streams are data structures proposed for 
inclusion in several research programming 
Languages, including VAL, to promote parallel 


execution and to implement input-output in appli- 
cative systems. To avoid paying a larae overhead 
cost in near-term multiprocessor systems, we pro- 
pose a special version of streams whose implemen- 
tation efficiency potential does not impair their 
usefulness in typical applications. Special 
streams require no dynamic storage management 
during element production and consumption. They 
are part of a VAL implementation effort for the 
Denelcor HEP multiprocessor system. 


Introduction 


A stream is a data structure containing an 
ordered sequence of values. A stream differs 
from a vector because elements are accessible 
only in the given order. It differs from a List 
in that some leading elements may be missing 
(having been consumed) and some trailing elements 
can be absent (not having yet been produced). In 
its general definition a stream also differs from 
@ gueue because each consumer of a stream obtains 
a complete stream cf all the produced values. 
The distinguished value "end of stream" appears 
after the Last ordinary stream element. Streams 
are important for introducing general pipelined 
computations as well as input-output capabilities 
in parallel applicative or data flow Languages 
D2 So he Ig Tea 


Our interest in streams results from an 
ongoing project to implement a version of the VAL 
data flow language [1, 6] on the Denelcor HEP 
multiprocessor system [9]. The addition of 
streams to VAL provides for general pipelined 
computation yielding the potential for greatly 
increased parallelism and more effective perfor- 
mance on the HEP architecture, as well as forming 
the basis for defining input-output facilities. 


In this report we will describe the archi- 
tecture that forms the hardware base for the pro- 
ject, outline general streams as defined in the 
VAL proposals, and define special streams that 
show promise for efficient implementation and 
high speed performance in near-term multiproces- 
sor systems. 


This research is supported by ARO Contract 


DAAG29-82-K-0108. 
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The HEP Multiprocessor System 


Denelcor, Inc. produces the Heterogeneous 
Element Processor (HEP), a shared-resource MIMD 
multiprocessor system. A HEP computer consists 
of one or more process execution modules (PEMs) 
connected by a packet-switched network to data 
memory modules and a high speed input-output 
cache. Data memory contains up to 1024K of 64- 
bit words. Each data word has, in addition to a 
value, an empty-full state that may serve to pro- 
vide producer~consumer synchronization among 
processes sharing data memory. A PEM incluces 
program memory (executable code), constant memory 
(read-only values), and register memory (general 


purpose fast storage, also with the empty-full 
property). All memories are allocatable via base 
and bounds registers to "tasks," groups” of 


is embodied as 
in a queue containing 


cooperating processes. A process 
a process state word (PSW) 


up to 128 entries as part of PEM hardware. Each 
100 nanoseconds a PEM examines the PSW for a4 pro- 
cess in an "active" task, examines its next 


instruction and tests the state of the source and 
destination registers. If source registers are 
full and the destination is empty, then one of 
several eight-stage pipelined function units ini- 
tiates the operation and the PEM increments the 
instruction address in the PSW. If not, the PSW 
rejoins the tail of the queue for retry Later. 
Note that in the presence of an ample quantity of 
work, this results in a "not very busy wait" 
solution to the process synchronization problem. 
A load or store instruction initiates interaction 
with data memory vie the switch network. Because 
of the Length of pipelined function units, a PEM 
must be executing instructions for at least eight 
processes to take advantage of the potential 10 


MIPS instruction execution rate. Although PEM 
hardware currently supports 56 user processes, 
parallel processing speedup does not increase 


Linearly up to this value because the number of 
independently executable function units in a PEM 
is much smaller. The remainder of the PSW queue 
slots hold PSWs for supervisor processes managing 
each user process, or are reserved for worst-case 
Situations when hardware process creations are 
still in progress. 


In data flow architectures, an operator unit 
executes to produce a result as soon as all 
operands are present, and the result is broadcast 
to all other operator units that use this value. 
Data dependencies are easy to find in VAL pro- 
grams (lack of side effects, single assignment) 
and a translator can produce data flow graphs 
effectively. A translator could produce HEP 
processes that perform atomic arithmetic or Logi- 
cal operations sharing data or register memory 
cells for input ana output but the overhead would 


preclude satisfactory parallel execution. type IS = streaml integer J; 


Instead, we intend to implement two forms of con- function INTSC LMT: integer returns IS ) 
Currency at a higher level. First, a VAL func- TOP’ 2h) 22:54 

tion invocation will initiate a HEP process that while I <= LMT 

executes in parallel with the invoking function repeat 

and with functions that it calls in turn. I <= old I+ 2 

Second, some parallel loops will have their returns stream of I 

bodies executec simultaneously by several end for 

processes. This level of process granularity end function 

will, for sizable programs, use the entire capa- 

bility for parallel operation in a single PEM (2) A function that produces the negation of the 
machine. It will be easy to adjust this granu- contents of an integer stream-- 


Larity if appropriate. Since a HEP PEM supports 


; ae : IEGA : t 
56 processes maximally, some Limitations’ and DUPE AOGE NE OPEN eo ie crete: ae 


. i . ‘ r ins 
refinements to this intent are required. (Solv- tor I 
; : ; returns stream of -I 
ing the process management problem is the subject Sad ee 


of another report.) In the next section we will 
describe how general streams are defined, pro- 
duced, and consumed. 


end function 


(3) A function that eccepts a stream of integers 
and emits a stream whose elements come from 
the input stream, excepting those that are 
multiples of another integer parameter 
(includes the phrase "unless boolean expres- 
VAL functions in which parallel or seauen- sion" to exemplify conditional stream ele- 

tial loop control structures yield individual ment production) -- 

stream elements produce values of type stream. 

The same control structures use up stream values 

element by element. In addition, they consume 

input file values and produce output file 


General Streams 


function FILTER( S: IS; P: integer 
returns IS ) 
for I in S do 
returns stream of I 


SU reams unless mod(I,P) = Q 

The operations associate with general end for 
streams in VAL are either implicit in_ the end function 
Language semantics or explicitly represented in 
Operators of the Language. Implicit operations (4) A function to merge two ordered streams into 
are the creation and deletion of streams, append- a single output stream (uses "first" and 
ing the end-of-stream value, and the production "rest" to consume streams)-~ 


of copies of entire streams for new consumers. 
We will discuss the last operation later. The 
invocation of a producer’ function (process) 
instantiates a stream. VAL does not admit the 
possibility of more than one producer for 2 par- 
ticular stream. Stream deletion results when all 
consumers have terminated and future existence of 
additional consumers is not possible. Producer 
termination causes the implicit appending of the 
end-of-stream value. 


function MERGE(C SA, SB: IS returns IS ) 
for TA := SA; TB := SB 
while not Cempty( TA ) and empty( TB )) 
repeat 
TA, TB := 
if empty( old TA ) then 
old TA, rest(¢ old TB ) 
elseif empty( old TB ) then 
-rest( old TA ), old TB 
elseif first( old TA ) <= 


Explicit stream operations include copying first( old TB ) then 
the front stream element, "first," replicating rest( old TA ), old TB 
the strea except for the first element, "rest," Ee Ob in meee Od ae 


, i : ena if 
testing tor enc of stream, "empty," and adding an returns stream of 


element to the end of a stream, “returns stream 


of." if empty( TA ) then first( TB ) 
: elseif empty< TB ) then first( TA ) 
Some simple examples of stream usage follow. elseif first( TA ) <= 
The syntax is that for ae revision of VAL first( TB ) then first TAD 
described in (8). else first( TB ) 
end if 


(1) A function that produces the odd positive 
integers up to a given Limit in a stream 
(uses the sequential "for" Loop)-- 


end for 
end function 


(5) A function that accepts an integer stream 
parameter and produces an output stream 
whose N-th element is the sum of the first N 
values in the input stream-- 
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function COMPOUND( S: 
for T := S; V :=0 
while not empty( T ) 
repeat 
V := old V + first( old T ); 
T := rest( old T ) 

end for 

end function 


IS returns IS ) 


(6) A function that produces both the sum and 
the product of the elements of its integer 
Stream argument (Ceach "for'' loop consumes 


the entire stream)-- 


function SUMPROD( S: IS 
returns integer, integer ) 
for I in S$ 
returns value of plus I 
end for, 
for IlinsS 
returns value of times I 
end for 
end function 
(7) A function that accepts a stream of integers 
and an integer value K, and emits the origi- 
nal stream with each element multiplied by 
the K-th element of the input stream (the 
unwritten auxiliary function KTH consumes 
through the K-th value, then the "for" Loop 
consumes the entire stream to produce the 
output stream)-- 


function MAGNIFY( S: IS; K: integer 
returns IS ) 
let V := KTHC S, K ) 
in 
for I ins 
returns stream of V * I 
end for 


end let 
end function 


Observe that in each of these examples the 
function operates asynchronously as a "pump" to 
take in stream values or to push out computed 
values on an output stream. This is reminiscent 
of "systolic architectures" [£5] and has the same 
potential for large-scale parallel execution. 
However, on the HEP system, software processes 
(instead of fixed hardware components with static 
interconnections) will implement stream producers 
and consumers. 


The last two examples show the effect of the 
implicit "copy an entire stream" operation. In 
spite of the existence of a consumer using up 
stream elements, another consumer must be able to 
access all elements generated by the producer. 
This means that the decision about when a stream 
element is discardable, and when an entire stream 
can be deleted, is a difficult run-time issue. 
In near-term, von Neumann architectures, it seems 
that general stream implementations will require 
dynamic storage management for Linked Lists of 
stream values and the capacity to store arbitrary 
numbers of unconsumed stream elements. The cost 
of such management stands in opposition to our 
goal of promoting high speed parallel execution. 
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In the next section we restrict our defini- 
tion to "special streams''’ whose implementation 
may allow high speed processing on the kinds of 
architectures available now and in the near 
future. 


Special Streams 


These definitions of operations on streams 
are meant to promote a more efficient implementa- 
tion than is possible for general streams in 
near-term von Neumann multiprocessor systems. In 
particular we propose changes in the semantics of 
general streams to avoid dynamic storage manage- 
ment during the Lifetime of a stream. Programs 
use the explicit operations "first," "rest," 
"empty," and “returns stream of" as for general 
streams. The implicit (Cautomatic in Language 
Semantics) operations for stream creation, 
appending the end-of-stream value, and deletion 
are unchanged. Implicitly copying an entire 
stream 1S no longer possible: the use of the 
Same stream name by multiple consumer Loops in a 
function results in each consumer Loop obtaining 
a disjoint subsequence of the stream of values. 
(There can be programs in which this is a desired 
effect.) If more than one consumer must share the 
entire stream, then an explicit preceding assign- 
ment to another stream value iS necessary. We 
define passing a stream as a parameter to another 
function as another form of the explicit stream 
copy operation. 


The stream copying via assignment or parame- 
ter passing may be actual: a new stream structure 
receives a copy of the current stream contents. 
Then the stream producer emits a new value to the 
tail of each copy of the stream it produces (or 
blocks if at least one of the stream data struc- 
tures is full), and appends the end-of-stream 
value to each when it terminates. Each consumer 
works via "first," "rest," and "empty" operations 
on its own private data structure. When a consu- 
mer terminates, deletion of the associated stream 
copy occurs. On the other hand, the stream copy- 
ing can be realized by associating a reference 
count with each stream buffer (Chcolding a single 
Stream element). The producer adds a reference 
count Cinitialized to the current number of con- 
sumers of the stream) to each emitted stream 
value. Each consumer works on the same data 
structure, and so must maintain its own pointer 
to the first value. The "rest" operation decre- 
ments the reference count of the first value, and 
moves the first pointer. The producer may fill a 
buffer when its reference count 1s zero, and must 
block if there is no empty (zero reference count) 
buffer. A stream copying operation via assign- 
ment or parameter passing increments the current 
number of consumers and the reference counts of 
each nonempty stream buffer. Consumer termina- 
tion decrements the current number of consumers 
and the reference counts on unconsumed elements. 
A stream is deletable when the current number of 
consumers reduces to zero. No advantage accrues 
to the first approach, and since the _ second 
requires less copying of information, we will use 
it. 


These special streams of course behave dif- 
ferently than general streams, and some programs 
using streams differ from those using previously 


defined versions. We believe that special 
strea~s are as useful as general streams in most 
applications. The last two program examples 


above using the implicit stream copying operation 


Cannot work in that form. The former can be 
rewritten with an auxiliary assignment: 
function SUMPROD( S: IS 
returns integer, integer ) 
let T := § 
in 
for I ins | 
returns value of plus I 
end for, 
for I inT 
returns value of times I 
end for 
end let 
end function 
The latter presents a difficulty for special 
streams. No matter how we implement = special 
streams to avoid dynamic storage management 


(copying for each consumer or reference counts on 
values), the "first" pointers for all the consu- 


mers must be within a contiguous substream. The 
size of this substream cannot exceed the number 
of buffers in the stream cata structure because 
the producer cannot get more than thet number of 
elements ahead of the slowest consumer. No per- 
formance problems should arise in the expected 
kinds of of "systolic" stream applications when 
the executing program has ample work to keep the 
components of a multiprocessor system active. In 
MAGNIFY, if K is greater than the number of 
buffers available, the subsequent "for" Loop will 
not be able to consume some prefix stream values, 
and the function value will differ from that for 
general streams. Problems such as this may arise 
in real programs, but such occurrences may sug- 
gest that randomly accessible structures (arrays) 
are more appropriate. 


summary 


Streams have been recognized by several 
researchers to be useful for promoting parallel 
computation. There is, however, no experience in 
production systems with general streams. Our 
purpose is to implement a version of the parallel 
Language system VAL including streams on a 
currently available multiprocessor system. Spe- 
cial streams as described here seem to be a valu- 
able compromise preserving the potential for 
parallel execution while enhancing the possibil- 
ity of efficient implementation. 
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A DATABASE MACHINE 
FOR VERY LARGE RELATIONAL DATABASES 


G. Z. Qadah and K. B. Irani 
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Ann Arbor, MI. 48109 


ABSTRACT -- In this paper, we present an architectural 
design for a Back-End Database Machine (DBM) suitable 
for supporting concurrent, on-line, very large relational 
database systems. This machine is called Michigan Rela- 
tional Database Machine (MIRDM). In designing such a 
machine, a structured approach has been followed. 
First, the DBMs proposed so far have been reviewed 
using a novel classification scheme. Next, this review, 
the very large relational database system requirements 
and the restrictions imposed by the current and near 
future state of technology has bean used to formulate a 
set of design guidelines. Consequently, an architecture 
for a cost-effective DBM that meets the latter set of 
guidelines has been synthesized. 


1. Introduction 


The collection of data in the form of an integrated 
database is a sound approach to data management. The 
conventional implementation of the database system- 
that is, the augmentation of a large general purpose 
conventional von Neumann computer with a large com- 
piex software system, the Database Management System 
(DBMS)- suffers from several disadvantages. These 
disadvantages are: 


(1) Low reliability due to the large complex software 
systein, 


(2) Poor performance due to the fact that the underly- 
ing hardware is a general purpose von Neumann 
processor with insufficient rrocessing power, little 
parallelism, and 

(3) Inability to meet the demands for increased pro- 
cessing power and throughput to fulfill the current 
and anticipated large increases in database size 
and usage. 


The limitations of the conventional database sys- 
tems, the continuous advancements in memory- 
processor technology, and the continuous reduction in 
its fabrication cost have inspired a new approach to the 
database system implementation. This approach 
replaces the general purpose von Neumann processor 
with a dedicated machine, the Database Machine (DBM), 
tailored to the data processing environment. Mostly it 
utilizes paralle] processing to support some or all the 
functions of the DBS. This approach improves the 
system's reliability through software complexity reduc- 
tion and improves the system's performance through 
specialization, increased parallelism and increased pro- 
cessing power. 


Most of the DBDMs proposed so far have been organ- 
ized as "back-end" machines to one or more generai 
purpose computer(s), called the Host{s). While the host 
is responsible for interfacing the users to the DBM, the 
DBM itseif is responsible for the database access and 
control. The "back-end" design concent for the DBM was 
first introduced in | 1]. 
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The general objective of this work is the design ofa 
back-end DBM suitable for supporting the concurrent 
on-line, very large relational database systems. This 
machine will be called the Michigan Relational Database 
Machine (MIRDM). The design of MIRDM is done in two 
steps. In the first step, the DBMs proposed so far are 
reviewed. This review is based on a novel scheme for 
classifying these machines. The new scheme helps us 
understand the previous organizations of the DBMs as 
well as their design trade offs. It also provides us witha 
systematic way to qualitatively analyze and compare 
such organizations. 


In the second step, the above analysis coupled with 
the requirements of the very large relational database 
systems as well as the current and near future state of 
the hardware technology is used to arrive at a set of 
guidelines along which our DBM must be designed. Con- 
sequently, an architecture for a cost-effective DBM that 
meets the latter set of guidelines is synthesized. 


This paper is divided into five sections. Section 2 
qualitatively reviews, analyzes and compares the various 
previously proposed designs for the relational DBMs. 
section 3 formulates a set of guidelines based cn the 
qualitative analysis of section 2, the current and near 
future state of technology and the very large relational 
database systems requirements. It also outlines a new 
DBM, called MIRDM, synthesized along the latter guide- 
lines, and it presents a brief introduction to the algo- 
rithms that implement the primitives of this machine. 
section 4 provides qualitative comparisons between 
MIRDM and some other DBMs already proposed. Finally, 
section 5 gives some concluding remarks. 


2. Review of the Previously Proposed DBMs 


During the past decade, a large number of DBMs 
have been proposed. Some of them have also been 
implemented. Others have been commercialized. All of 
these machines have been designed to partiaily or 
totally support the relational database or to support the 
relational database together with the other database 
types, namely. the network and the hierarchical. in the 
following, a novel scheme for classifying the set of the 
DBMs proposed so far will be presented. This scheme 
will next be used to gualitatively evaluate and compare 
the respective DBMs. 


2.1. A Classification Scheme for the Previousiy Pro- 
posed DBMs 


The new scheme views the DBMs as points in a three 
dimensional space, the DBM space. The coordinates of 
this space are the indexing level, the query processing 
place and the processor-memory Organization. 


The most fundamental and important operations 
the DBMs were designed to support are tne selection 
(from a permanent relation) and the modificaticn opera- 
tions. In the early designs of the DBMs these operations 


qualification expressions), however, they perform poorly 
in executing more complex database operations that 
require many disk revolutions (the %-Join and the pro- 


jection operations, for example). This was evident in the | 


performance evaluation of RAP([5]. 
Recall that a query can be thought cof as a tree 


level does not improve the DBM cost-effectiveness. It is 
very clear that such DBMs suffer from the same prob- 
lems as the DBMs-DB type. 


- The index tables supported by the DBMs which 
belong to the DBMs-Page category are defined for the 


get of the most frequently referenced attributes of the 


whose nodes represent a set of database operations [3]. 


The leaves of the tree reference only permanent rela- 
tions of the database. In a real database environment, 
the leaf nodes are mostly of the selection and update 
types. A hybrid DBM processes the leaf selection opera- 
tions and, in some machines, the update operations on 
the disk. The resulting relations( referred to as the 
temporary relations) are then moved to a _ fast 
processor-memory complex where the rest of the query 
operations (if any) are executed. In most cases, execu- 
tion of the ieaf selection and update operations on the 
disk largely reduces the volume of data needed to be 
moved to the fast processor-memory coniplex. 


The above discussion indicates that the perfor- 
mance of the DBMs of the Hybrid-DB design approach is 
superior to that of both the On-Disk-DB and the Off- 
Disk-DB DBMs. However, the Hybrid-DB design approach 
compounds the cost problem because it combines two 
very expensive technologies, namely, the logic-per-track 
disks and the associative memories. 


In general, the number of tuples which are 
selected/modified as a result of executing a typical 
selection/modification operation is a small fraction of 
the number of tuples of the database. In the light of the 
current processor-memory technology and the vast 
amount of data in the very large databases, it is clear 
that the need to scan the whole database, at least once, 
for carrying out these operaticns is very cost- 
ineffective. The design approach which provides the 
DBM with a mechanism that eliminates this need is 
highiy desirable. 


In the context of the DBMs, an index table defined 
for the permanent database has been used as a mechan- 
ism to reduce the amount of data to be processed for a 
piven selection or modification operation. For every 
relation of the database the index table defined for a 
DBM of the DBMs-relation category store the set of 
addresses of the minimum access units(MACUs) which 
store the corresponding relation. Although the size of 
the index table (relative to that of the database) is a 
function of the number of relations in the database and 
the size of the MACU, nevertheless, it is very smail. Its 
maintenance, storage cast and access time are negligi- 
ble relative to that of the database. To execute a 
selection/modification operation on an  On-Disk- 
Relation/Hybrid-Relation DBM, only the data units which 
store the relation referenced by the operation are 
searched/modified. For those machines which use a 
logic-per-track disk as a storage media, only the tracks 
which correspond to these units are processed. On the 
other hand, to execute these operations on an off-disk- 
relation DBM, only the data units which store the rela- 
tion referenced by the operation need be moved to tne 
processor-inemory complex for processing. 


The above argument suggests that the DBMs which 
support only a relation level index table perform dif- 
ferently for different types of databases. In the case of 
databases dominated by relations of small size, indexing 
on the relation level substantially improves the DBM 
cost-effectiveness. For the on-disk/hybrid organized 
DBMs, the logic-per-track disk can be replaced by a less 
expensive mass storage media. For the Off-Disk DBMs, a 
less expensive I/O channel can be utilized. On the other 
hand, in the case of the databases dominated by rela- 
tions of relatively large size, indexing on the relation 
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database. For every value of each of these attributes, 
the index table stores the set of addresses of ail the 
pages which contain tuples having that value. Although 
the size of the index table (relative to that of the data- 
base) is a function of the size of the page itself, 
nevertheless it, as well as its storage maintenance cost, 
is substantial. A clustering mechanism has to be used 
'6] in conjunction with the page index table. This 
mechanism is used to store the set of tuples which are 
frequently referenced together in as few pages as possi- 
ble. In general, the use of the page indexing coupled 
with an efficient clustering mechanism improves the 
performance of the selection and possibly the modifica- 
tion operations. On the other hand, the index search 
introduces some overhead for executing these opera- 
tions as well as the insertion and deletion operations. 


in the case of the very large database systems, the 
use of small size pages results in a relatively large page 
index table which requires a huge amount of storage, 
large access time and high maintenance and update 
cost. The improvement in the execution time for the 
selection and modification operations does not justify 
this overhead and the additional storage cost. On the 
other hand, the use of large size pages, in conjunction 
with a DBM, results in a relatively small page index (~1% 
of the total database size [6]). Thus its storage cost, 
access time and maintenance cost are substantially 
lower. Moreover, the performance of the selection and 
the modification are enhanced especially if a page is 
processed by a number of processors in parallel. Using 
the page index,large size pages and efficient clustering 
mechanism have allowed designers to replace the 
expensive logic-per-track disk with the relatively cheap 
moving-head-disk (or a slightly modified version of it) as 
a unit for the mass storage. 


In the scheme, presented earlier, the previously 
proposed DBMs have been organized as SISD, SIMD and 
MIMD machines. In general, the execution of the data- 
base operations, on a DBM of the first/second group, are 
done serially. That is, one operation (possibly two for 
the Hybrid DBMs) is the maximum number of operations 
that a DBM of these types can execute at any given time. 
While a database operation is executed by one processor 
on the SISD DBMs, it is executed by more than one pro- 
cessor on the SIMD DBMs. The MIMD DBMs, on the other 
hand, execute one or more database operations simul- 
taneously. The operation itself also gets executed by 
more than one processor. 


Because of the relatively low cost of the processor 
and tnemory devices, the use of parallel processing for 
very large databases enhances the effectiveness of the 
DBMs. Although the SIMD organization of the DBMs 
reduces the execution time of a database operation, it 
does not offer a real solution to the concurrent user 
problem. The MIMD organization is more effective in 
supporting databases where fast concurrent accesses to 
the databases is a basic requirement. This is due to the 
fact that the MIMD organization has the ability not only 
to execute a database operation in parallel out also to 
execute more than one operation (from same or dif- 
ferent queries) simultaneously and in parallel. 


One important drawback in the MIMD organization 
is the associated overhead, namely, the large amount of 
processing needed for controlling the simultaneous exe- 
cution of different operations and the management of 


were carried out using an entirely associative approach. 
That is, the whole database (a set of permanent rela- 
tions) was scanned and the data items which satisfied 
the selection/modification criteria were 
retrieved/modified. Soon, researchers in the DBMs field 
came to realize that such an approach is not a cost- 
effective one since it requires that the whole database 
be scanned, at least once, for every selection or modifi- 
cation operation regardless of the size of the operation's 
response set. 


To achieve a more cost-effective DBM's design, a 
new approach for performing the selection and the 
modification operations has been followed. It is called 
the quasi associative approach. In this approach only a 
relatively small portion of the database need to be pro- 
cessed for every operation (rather than all). To support 
such an approach, the database is divided into a set of 
data units. In order to perform the selection operation, 
for example, the machine first maps the corresponding 
selection qualification expression to the set of data units 
which contain the data which satisfy the qualification 
expression. Then each data unit of the tatter set is 
searched and the data items which satisfy the selection 
qualification expression are extracted. 


In most of the DBMs proposed so far the structure 
used to map the qualification expression of the selection 
or the modification operation, to the corresponding data 
units of the permanent database, is the index tables [2]. 
These tables are defined for the database and need to be 
stored and maintained. The data unit, the smallest 
addressable unit of data, can be logical (1.e., the data- 
base, no indexing, or a relation) or physical {i.e., a set of 
tracks, a track or part of a track of a moving-/fixed- 
head-disk). The physical data unit is called a page. 


The first coordinate in the proposed scheme is the 
indexing level defined for the permanent database and 
supported by the particular DBM. Along this coordinate, 
the DBMs can be grouped into three categories, namely, 
the DBMs with database indexing level (DBMs-DB), DBMs 
with relation indexing level (DBMs-Relation) and DBMs 
with page indexing level (DBMs-Page). The first category 
includes all the DBMs which support only the entirely 
associative approach. The DBMs of the second category 
support the quasi-associative approach. The index tables 
of the machines in this category are defined with the 
permanent relations as the minimum addressable 
units.* The DBMs of the third category also support the 
quasi associative search approach. However, in addition 
to supporting the relation level index tables, they also 
Support index tables defined for pages (containing 
tuples from the permanent relations) as the minimum 
addressable units. 


The second coordinate in the proposed scheme is 
the query processing place, Along this coordinate, the 
DBMs can be grouped into three categories, namely, the 
Off-Disk,** the On-Disk and the Hybrid categories. The 
DBMs of the first category process the database query 
cff the disk where the database is stored. In doing that 
the DBMs of this category need to move the data which 
are relevant to the query from the disk to a separate 


*To facilitate the parallel processing as well as the data move- 
ment, some DBMs of the first/second category store the corresponding 
minimum addressable unit of data (DB/relation) on a set of physical 
units, the minimum access units {MACUs). Each could be moved 
separately. However when a data item needs to be retrieved, all the 
MACUs containing the DB/relation are processed. In the DBMs of the 
third category, the page (the minimum addressable unit) is contained 
in one MACU. 


**The disk here implies a moving-head-disk, a fixed-head-disk or 
an electronic disk such as the magnetic bubble memory (MBM) or the 
charge coupled devices memory (CCD). The. disk(s) stores the data- 
base. 
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processor-memory complex where the query processing 
takes place. 


The DBMs of the second category execute the data- 
base query on the disk. The machines of this category 
need not move data from the disk to a different memory 
for processing. The disk is provided with logic units and 
the query processing is done directly on the disk where 
the database is stored. 


The DBMs of the third category execute part of the 
query the selection operations and, in some machines, 
the update operations on the disk, and move the result- 
ing data to a separate processor-memory complex 
where the rest of the query execution (if any) takes 
piace. 

The third coordinate in the proposed scheme is the 
processor-memory organization. This coordinate 
characterizes the hardware of the DBMs. For the On- 
Disk /Off-Disk machines, this coordinate characterizes 
the way the processor-disk/processor-memory complex 
executes the database operations. For the hybrid 
machines, this coordinate characterizes the way both 
the processor-disk and the processor-memory execute 
the database operations. Along the third coordinate, 
the DBMs can be grouped into three categories: 1) the 
single instruction stream-single data stream (SISD), 2) 
the single instruction stream-multiple data stream. 
(SIMD), and 3) the multiple instruction stream-multiple 
data stream (MIMD) categories. 


2.2. A Critical Look at the Previously Proposed DBMs. 


Figures 2.1 through ¢.3 show most of the previously 
proposed DBMs grouped according to the classification 
scheme presented earlier. For more information about 
these machines one can refer to [3] and to the refer- 
ences presented in the latter figures. 


istorically speaking, the first DBMs to be proposed 
were organized as Off-Disk machines and were provided 
with only the associative access to the database (Off- 
Disk-DB DBMs). In general, these machines suffer from 
many drawbacks, in particular, their ineffectiveness in 
handling the very large database systems, A DBM of this 
type has to move all of the database from the slow disks, 
where it is stored, to a fast (associative) memory where 
the execution of the selection or the update operations 
take place. In a large database system environment, 
the I/O channels easily become a system bottleneck as 
a result of the mass data movement. 


The On-Disk-DB design eliminates the data move- 
ment problem. This design approach avoids that prob- 
lem by processing the database operations on the mass 
storage device where the database resides. The most 
important drawback of this approach is the fact that it 
stores the database on a set of logic-per-track disks. 
Using this disk as a mass storage device is very costly, 
in fact orders of magnitude more expensive than the 
moving-head-disk. The high cost of the logic-per-track 
disk is attributed mainly to the high cost of its two basic 
components, namely, the storage component, the fixed 
head-per-track disk and the processing units attached 
to each head of the head-per-track disk. The antici- 
pated trends in the mass storage technology [4] show 
that the logic-per-track disk or its electronic counter 
part will not challenge the speed/cost level of the 
moving-head-disk for at least the near future. 


Another important problem in the DBMs which 
adopt the On-Disk-DB design approach is the fact that 
they perform relatively well in executing the simple 
database operations which require few disk revolutions 
(the selection and update operations with simple 


the various system components. In most MIMD DBMs, 
such overhead puts an upper limit on the number of 
queries that can be executed and on the resources tnat 
can be simultaneously active in the system. Therefore, 
controlling and minimizing such overhead must be a 
basic objective for the MIMD DBM designer. 


-3. The Architecture of the New System 


The most important characteristics of the contem- 
porary and anticipated very large scale relational data- 
base systems are the vast amount of data in such sys- 
tems and the large number of users requiring simul- 
taneous access to this data. These basic characteristics 
impose two important requirements on any design for a 
relational DBM, namely: 


4 
a. 


g. 


Availability of large capacity store, 


Ability to handle the on-line concurrent access to 
the database with adequate response time and 
throughput. 


Based on the above requirements as well as the 
State of the current and anticipated mass storage tech- 
nology and in the light of our study for the previously 
proposed DBMs, a set of guidelines along which a DBM 
should be designed, have been formulated. These guide- 
lines are: 


(1) The mass store is to consist of moving-head-disks. 
This disk type has been selected for its ability to provide 
avast amount of on-line storage at a relatively low cost 
and moderate performance, Currently the magnetic 
fixed head-per-track disk is considered obsolete as a 
mass storage device. The electronic disk {the MBM and 
the CCD memory devices} technology is at least one 
order of magnitude more expensive than that of the 
moving-head-disk. A look at the future directions in the 
mass storage technology shows that the electronic disk 
will not challenge the speed/cost level of the moving- 
head-disk for at least the near future [4]. 


(2) Support the page level indexing. This type of index- 
ing greatly improves the execution time of the selection 
and the modification operations. On the other hand, it 
introduces some overhead in executing these opera- 
tions, in the ferm of index table access delay and 
maintenance, and increases the execution time of the 
other update operations. To minimize the drop in per- 
formance due to this overhead, the page must be 
selected to have a large size (multiple tracks of the 
moving-head-disk) and be processed in parallel, by a 
number of processors. Also, the access to the page level 
index table should be supported at the hardware level. 


(3) Organize the DBM as an off-disk type. Although this 
organization introduces some increases in the execution 
time of the database operations ( due to the movement 
of data to the processor-memory complex), it neverthe- 
less avoids providing the moving-head-disk with a high 
Speed, speciailvy designed logic units capable of process- 
ing data "on the fly", thus keeping the mass storage cost 
at its minimum. The amount of data to be moved can be 
reduced by taking advantage of the iocal ana sequential 
references in the database. The processor-memory 
complex should be designed to effectively support not 
only the reiational algebra operations, such as the selec- 
tion, the projection, and the @-Join, but also the primi- 
tives that manipulate the page level index. 


(4) Organize the DBM as an MIMD type. This is very 
important for providing the machine with the capability 
of handling concurrent access to the database. The pro- 
posed design must be able tc handie, at the hardware 
level, the excessive overhead associated with the MIMD 
organization. 
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A DBM which follows the above guidelines has been 
designed. This machine is called Michigan Relational 
Database Machine (MIRDM). However, before presenting 
its organization, the way the data is organized in MIRDM 
will be next outlined and discussed. 


3.1. The Data Organization 
In general MIRDM stores two types of data, namely: 
The Database 


The database is organized as a collection of time 
varying relations. The database is divided into a set of 
large data units. Each, called a page (or the minimum 
addressable unit** (MAU)), represents the smallest 
addressable unit of data. The only tuples which are 
allowed in the same MAU are those that belong to the 
same relation. 


2. the Database Directory 


The database directory contains the information 
needed to map a "data name" to the set of MAU 
addresses which store the named data. In the proposed 
machine, data is named at two levels, namely, the rela- 
tion level (relation-name) and the tuple level (tuple- 
name: <relation name, attribute name, value>}. The 
tuple-name may not be unique. 


es 


The database directory consists of two indices, 
namely the relation index and the MAU index. The rela- 
tion index maps the "reiation-name” to a set of MAU 
addresses. These MAUs contain all the tuples of the 
relation whose name is "relation-name". The relation 
index contains entries for all the relations in the data- 
base. 


The MAU index maps a "tuple-name” to a set of MAU 
addresses. Each of these MAUs contains at least one 
tuple which has the "tuple-name" as its name. The MAU 
index is organized as a three level index. The first level 
is the MAU master index, the second level is the attri- 
bute index and the third level is the index-term index. 
The index-term index is a collection of index terms, 
each of which is an ordered quadruple of the form: 


< relation name, attribute name, value, MAUA > 


where MAUA is an MAU address which contains at least 
one tuple with the name "< relation name, attribute 
name, value >". 


In general, the index terms are only defined for 
those relations and attributes which are frequently 
referenced by users. The index terms in the proposed 
system are grouped and stored in units equal in size to 
an MAU, called the index MAU (IMAU). Although the 
[MAU can contain index terms defined for different attri- 
butes of different relations, a clustering mechanism is 
used to cluster, into the same IMAU, those index terms 
which are defined for the attributes of the same rela- 
tion. This improves the storage cost and the processing 
efficiency of the index terms. 


The MAU master index and the attribute index have 
been introduced in order to reduce the number of IMAUs 
which need to be processed for a selection/modification 
operation. The MAU master index maps a "relation- 
name” to its attributes. The attribute index maps 4 
attrioute name to a set of IMAU addresses. The IMAUs 
contain the set of index terms which are defined for the 
corresponding attribute. 


Before leaving this section, an important operator, 
the indez-select, which performs the retrieval operation 


**As is seen later, the MAU occupies one minimum access unit 


(MACU). 


on the index-term Index, must be mentioned. The 
index-select operator is executed in conjuncticn with 
the relational algebra operation selection and the modif- 
ication operation. It retrieves the addresses of those 
MAUs which contain at least one tuple thal satisfies the 
corresponding qualification expression (QE). 


3.2. The Michigan Relational Database Machine Organi- 
zation 


The proposed Michigan Relational Database Machine 
(MIRDM), shown in figure 3.1, consists of four com- 
ponents, namely, the master back-end controller (MBC), 
the processing cluster subsystem (PCS), the mass 
storage subsystem (MSS) and the interconnection net- 
work subsystem (INS). In the following the organizations 
of these components will be outlined. 


3.2.1. The Master Back-ind Controller 


In cooperation with the front-end computer 
system(s), the master back-end controller fMBC) inter- 
faces the users to the database system, translates the 
user queries to the primitives of MIRDM, schedules and 
monitors the query execution, manages and controls ihe 
different components of MIRDM, stores and maintains 
the system's dictionary, stores, maintains, and manipu- 
lates part of the database directory (the relation, the 
MAU master and the attributes indices) and provides for 
security checking, integrity maintenance and users’ 
views. 

The implementation of the MBC is strongly depen- 
dent on the way the above stated functions are parti- 
tioned between the front-end computer system and the 
MBC. Based on this partition, the MBC can be imple- 
mented using a powerful mini/micro computer, 


3.2.2. The Mass Storage Subsystem 


The mass storage subsystem (MSS), shown in figure 
3,1, 1s the repository of the database and its index-term 
index. The MSS is organized as a two-level memory, 
namely, the mass memory (MM) and the parallel buffer 
(PB), While the MM helps MIRDM to meet the large capa- 
city storage requirement, the PB helps it to take advan- 
tage of the local and sequential references to the data- 
base. In the following, the architecture of both levels 
will be outlined. 


The Mass Memory 


The mass memory is organized as a set of moving- 
head-disks, controlled and managed by the mass 
storage controller (MSC). Hach disk is provided with the 
capability of reading/writing from/to more than one 
track in parallel. Tracks which can be read/written in 
parallel, from one disk, form what is called the 
minimum access unit (MACU). The tuples within this 
‘mit are laid out on a moving-head-disk’'s track in a "bit 
serial-word serial'' fashion. The MACU is the smallest 
accessible unit of data as well as the unit of data 
transferable between the MM, the PB and the PCS. The 
VACU in MIRDM stores only one MAL. We expect the 
MACU to have the size of a moving-head-disk cylinder. 


In addition to the database relations, the MM stores 
another type of data, namely, the index terms [n gen- 
eral, tre index terms which are cefined on attributes of 
different relations can reside in the same IMAU. In 
order to improve their retrieval cost, the index terms 
are clustered together according to their relation and 
aitribute names. 


31] 


An IMAU is stored in one MACU. Every track, within 
the latter unit, contains a set of blocks of suitable sizes 
(~ 2-4 K bytes). Each block contains index terms 
defined for the same relation and attribute. For storage 
as well as precessing efficiency, the <relation name, 
attribute name> common to all the index terms of the 
block is stored once, at the beginning of the block. The 
rest of the block stores only the <value, MAUA> part of 
the corresponding index terms. 


The Paraliel Buffer 


The parallel buffer (PB), shown in figure 3.1, is 
organized as a set of blocks, each of size equal to that of 
an MACU. A block is further partitioned into a set of 
subblocks, Hach subblock can buffer one track of a 
moving-head-disk. The PB is managed by the mass 
memory controller. 


The PB implementation can take advantage of the 
technologies of both the magnetic bubble memory and 
the charge coupled device memory. Both technologies 
currently have off-the-shelf memory chips which can 
buffer an entire disk track. 


3.2.3. The Processing Clusters Subsystem 


The processing clusters subsystem (PCS) is organ- 
ized as a multiple single instruction stream-multiple 
data stream (MSIMD) system. The PCS (figure 3.1) con- 
sists of a set of processing clusters (PCs) which share a 
common buffer, the paraliel buffer. A PC, shown in fig- 
ure 3.2, has a single instruction steam-multiple data 
stream (SIMD) organization. A PC consists of a set of tri- 
plets, each of the form: 

1/O controller (IOC), triplet processor (TP), 

local memory unit (LMU)> 
The set of triplets within a PC is controlled and managed 
by the cluster master processor (CMP). The latter 
accesses its triplets through a broadcast bus, the mas- 
ter bus (MBUS). The MEUS permits the CMP to write the 
same data to all the LMUs of its cluster triplets simul- 
taneously. On the other hand, the MBUS permits the 
CMP to sequentially read data from any one of iis tri- 
plets’ LMus. 

Within a PC, the data is moved between its triplets 
via a bus, the triplets bus (TBUS), controlled by a high 
speed DMA controller, the data mover (DM). Under 
instructions from the CMP, the DM moves data items 
between the LMUs of the cluster'’s triplets. The TBUS is 
provided with both point-to-point as weil as broadcast 
capabilities. 

In general, the i LMU, in a PC, is accessible 
directly by the CMP, through the MBUS, by the DM. 
through the TBUS, arid by both ihe i” TP and ICC. We 
expect the LMU of a tripiet to have a relatively large 
capacity (multiple of the size of a moving-head-disk 
track) and to be implemented using RAM technology. It 
is Suggested that the iOC of a triplet be implemented as 
a high speed DMA controlier and that a TP be an off-the- 
shelf microprocessor. 


3.2.4. The Interconnectioa Network Subsystem 


The interconnection network subsystem (INS) is 
designed to fulfill two basic requirements, namely, the 
ability to allow any two PCs or two moving-head-disks to 
read/write from/to any two blocks of the PE simultane- 
ousiy and cnable two or more PCs to read froin the same 
PB block, The latter requirement is needed to. provide 
the proposed machine with the capability of handling 
simultaneous processing of the database operations. 

The INS, snown in figure 3.1, is a modified version of 
an interconnection network proposed by Dewitt |7}. The 


network consists of a set of buses, each associated with 
one subblock and having one bit width. The subblock 
continuously broadcasts its contents over’ the 
corresponding bus. Only one triplet in each PC as well 
as one head of each moving-head-disk, in the MM, is con- 
nected to the same bus{subblock). Thus the complexity 
of the logic at the triplet, disk head or subblock inter- 
face is (1/ number of triplets in a PC) that of the one 
proposed by Dewitt. Whenever a PC(s)/(MM disk) needs 
to read a given PB block, its 1OCs(disk heads) need only 
switch themselves to the appropriate set of data buses. 
If the parallel buffer block contains a data MAU, then 
the ]OCs(disk heads) can begin to read at a tuple boun- 
dary. However, for an index MAU, the I0Cs(disk heads) 
proceed to read it at an index block boundary. When- 
ever a PC(MM disk) needs to write to a given parallel 
buffer block, its IOCs{disk heads) need only switch 
themselves to the appropriate set of data buses. The 
writing then foliows immediately. Notice that the MSC 
(figure 3.1) is responsible for preventing any two PCs, or 
disks, or a PC and a disk from writing to the same paral- 
lel buffer block. 


3.3. Algorithms for the Relational Algebra and Index 
Retrieval Operators 


The newly proposed MIRDM supports the Parallel 
processing of the most important relational database 
operators- namely, select, project and U-join- as well as 
the index retrieval operator index-select. In most 
retrieval queries, the operator project follows the select 
operator. For this as well as for reasons of performance 
improvement, the newly proposed machine combines 
the two operators, select and project, to form a new 
operator (the selecit-project Operator). The latter 
operator is processed as a non-decomposable operator. 


In MIRDM, one or more PCs are used to execute the 
select-projecl and the @-join operators. On the other 
hand, only one PC is used to execute the operator 
index-select. In general, the number of PCs assigned to 
execute a select-project or a ¥-join operator is an MBC 
decision. This decision is based on many factors, such 
as the operator type, the size{s) of input relation{s), the 
expected size of the output relation, the number of 
available PCs and the priority class to which the 
operator's query belongs. By accessing the appropriate 
directory, possibiy with the help of a PC, the MBC deter- 
mines the set of all MAUs relevant to a given operator, 


The flexibility and generality of MIRDM architectire 
permits the implementation of a powerful set of algo- 
rilhms for the above operators. These algorithms are 
presented in [8]. 


4. Discussion 


In the previous section, we have presented an archi- 
tecture for a back-end database machine which is capa- 
ble of supporting the on-line, concurrent, very large 
relational database systems. Our approach rests on a 
set of fundamental design principles. This set includes 
two principles followed by previously designed DBMs, 
namely, the MIMD organization of DIRECT [7] and the 
“page level indexing” of DBC [6]. While the MIMD organi- 
zation is very vaiuabie in handling the concurrent user 
environment, the "page level indexing" is equally impor- 
tant in supporting the very large database environment 
as well as in the reduction of the system data volume 
needed to be moved for the selection/modification 
operation. In contrast to the DBC, MIRDM stores the 
database structure information on the _ relatively 


inexpensive mass storage devices ( on moving-head- 


disks rather than the much more expensive electronic 
ones), manipulates this structure information using the 
same units (the processing clusters) which manipulate 
the database (thus distributing the systems workload 
uniformly among its various components) and provides 
the machine with the MIMD capability as well as the 
additional parallelism and processing power which are 
essential for meeting the requirements of the centem- 
porary and anticipated database systems. Finally, our 
proposed architecture removes the restriction imposed 
by the DBC on processing the @-join operation [8], 
namely, that both the source and target relations of the 
B8-join operation must fit in the local memory of a 
processor-memory complex, designed specifically to 
carry out the w@-join operation. The newly proposed 
machine has. the capability of joining relations of any 
sizes. 


In contrast to DIRECT, MIRDM groups the processing 
elements into a set of clusters, each cluster with its own 
controlling processor. The data transfer, in the new 
machine, is done in relatively large units (the MACUs). 
This organization not only improves the management 
and control of the processing elements and distributes 
the overhead caused by the processing of requests for 
the movement of the data units {this overhead is caus- 
ing a system bottleneck in DIRECT), but it also reduces 
the complexity of the interconnection network. Provid- 
ing the newly proposed machine with "page level index" 
as well as supporting its primitive cperations, at the 
hardware level, helps to improve the performance of the 
selection operation. The latter operation is performed 
poorly, on DIRECT, relative to other DBMs [9]. In the 
case of very large databases, our machine permits the 
implementation of a set of algorithms for the equi-join 
operation, more powerful than the one implemented in 
DIRECT. To demonstrate that fact, we have compared 
[10] the performance of the equi-join algorithms recomi- 
mended for the newly proposed machine with that 
adopted by DIRECT. These algorithms have been 
selected |3] from a large set of algorithms, based pri- 
marily on the behavior of one performance measure, the 
equi-join "total execution time”. This comparison shows 
that in a typical very large databased environment 
MIRDM, is (1.5 - 5) times faster in executing the equi-join 
operation than DIRECT. 


5. Conclusion 


In this paper. we have proposed a back-end data- 
base machine suitabie for supporting the concurrent, 
on-line, very large relational dalabase systems. The new 
machine is designed to satisfy a set of guidelines. These 
guidelines have been formulated based on reviewing the 
previously proposed DBMs, the current and the near 
future state of technology and the requirements of the 
concurrent very large relational database systems. The 
previously proposed DBMs were reviewed using a novel 
scheme for their classification. 

At the present, our research activities are centered 
around prototyping the new machine. Currently a pro- 
cessing cluster with 8 triplets is being implemented at 
the Computing Research Laboratory of the University of 
Michigan, Ann Arbor. : 
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EFFICIENT COMPUTING OF RELATIONAL JOIN OPERATIONS 
BY MEANS OF SPECIALIZED HARDWARE (a) 


Yang-Chang Hong 
Department of Mathematics 
University of California 

Riverside, CA. 92521 


Abstract -- A hardware architecture is presen- 
ted which can provide powerful join capabilities to 
associative processing (AP) systems. The main fea- 
ture of the hardware is a bit- and word-addressable 
store which can rapidly remember or recall data. 

The data might be the values or tuples selected 

from one relation, in which case the store helps to 
perform the joining of these values or tuples with 
the tuples in the second relation. For general 

case of the join, the store can help in dividing 

the tuples of the relations being joined intoblocks, 
according to their join column values. The conca- 
tenation of tuples in the corresponding blocks is 
then done by an array of servers. This hardware 
design emphasizes parallelism in the cross referenc- 
ing between tuples of the relations being joined, 
giving considerable performance improvement over 
existing AP systems. The paper finally gives an 
analysis of the results of hardware performance un- 
der different applications. 


Introduction 
Limitations of AP Systems 


Previous designs of associative processing 
hardware [1,2,3,5,6,8,9,11] for relational data- 
bases have concentrated on searching through data 
contained within a table. The search involving 
more than one table has been focused on extensive- 
ly. This form, called the "implicit" join [1,3,8, 
9,11], does not create a derived relation; instead, 
values selected from one relation are transferred 
to select the tuples in the second (or original) 
relation that have the same values in their join 
columns (i.e., columns on which the joining is 
based). The joining of tuples of relations (re- 
ferred to as the explicit join) is, however, carried 
out primarily by the MFC, to which the AP system is 
a backend. It will not be very effective if the 
number of tuples to be joined is large. 

The AP systems were based on the parallel pro- 
cessing of a segmented sequential search. While 
the join operation generally requires a great deal 
of corss checking between tuples of the relations 
being joined (which in turn results in a breakdown 
of parallelism) it is not, in itself, sufficient to 
make a high performance database machine, especially 
when a join-dominated database application is in- 
volved. Separate hardware which can perform a 


(A the work was in part supported by the National 


Science Council, Taipei, Grant #NSC70-0404-E001- 
02. The author is currently visiting the Depart- 
ment of Electrical Engrg. §& Computer Science, 
Univ. of Santa Clara, Santa Clara, CA 95053. 


0190-3918/83/0000/0315$01.00 © 1983 IEEE 315 


large amount of cross referencing in parallel must 
be sought. It is the purpose of this paper to aug- 
ment an AP system with hardware that will aid in 
the join operation. 


Approach 


The main feature of the hardware seems to be 
a bit- and word-addressable store which can rapid- 
ly remember or recall data. The data might be the 
values or tuples selected from the first relation 
being joined. They are then used to select the 
tuples in the second relation. For the general 
case of join, the store can help to divide the 
tuples of the relations being joined into blocks 
of tuples according to their join column values. 
The concatenation of tuples in the corresponding 
blocks is then done by an array of servers. The 
approach emphasizes parallelism in the cross re- 
ferencing between tuples of the relations being 
joined, providing a considerable performance im- 
provement over existing AP systems. A hardware 
Simulator was developed on the PDP-11/70 computer 
for determining hardware parameters - the number 
of servers and their associated queue length. It 
also had the ability, given a fixed number of ser- 
vers, to tell the performance of the hardware un- 
der different applications. 


Organization of the paper 


The body of the paper is divided into three 
parts: In this, first part the hardware architec- 
ture is described. The second part is concerned 
with the computing of relational joins on the pro- 
posed architecture. The third part is concerned 
with the performance analysis of the architecture, 
which is followed by a summary and conclusion. 


Hardware Architecture 


The architecture depicted in Figure 1 suggests 
that the selection of column values or tuples in a 
relation is done by the AP, while the joining of 
the tuples is performed by the extended hardware. 
The command and control processor (CCP) receives 
data requests from the MFC, translates them into 
commands, distributes those commands to the AP and 
extended hardware for execution, receives and for- 
mats the resulting data, and outputs the formatted 
data to the MFC. Like CASSM and RAP [9,11], we 
assume that data in the AP are stored in encoded 
form and that the encoding and decoding processes 
are done by the CCP (i.e., E.D.U.). 

The extended hardware consists of five major 
parts: IP, MB, RAM, S and CP. They are described 
as follows: 


(1) IP is an input processor which serves as 


a buffer between AP and extended hardware. It ac-~ 
cepts the column values or tuples selected fromthe 
AP and stores them into queue Q. The registers H 
and T are used to hold the locations of the first 
and last entries of Q, respectively. The flag Fg, 
when set to 1, indicates Q is full. The IP starts 
its operation whenever Q is not empty. | 


(2) MB is a memory bank for storing the tuples 
of the first relation being joined. It is divided 
into P modules, designated as M(i), 0 < i < p-l. 


Each module has q words. P and q are design para- 
meters. 


(3) RAM consists of single bit array stores 
(rA, rB, etc.) and an array r of words. The bits 
of the store and r-words are addressed by encoded 
values. The addressed bit can be set to 1 or 0 
and can be tested for being 1 or 0. The addressed 
r-word can hold an encoded value or a pointer point 
ing to a particular word of a particular module, or 
its contents can be fetched for various joining pur 
poses. 


(4) S is a set of servers which formsnew tuples 
of the join. Associated with each server S; is a 
queue Q;, 0 <i < p-1, for hoiding tuples of the 
second relation being joined. Like queue Q, each 
Qj; has two registers, T; and H;, and a flag F;. 
Each S; is designed to form new tuples from Q; and 
M(i), without any memory conflicts. Thus, there 
are as many S;'s as M(i)'s. A buffer is provided 
for each server to hold the new tuples it produces. 
The new tuples can then be either output to the MFC 
or stored back to the AP for further processing. 
This is accomplished by the output mechanism. 


(S) CP is a central processor which reads data 
stored in Q, addresses the RAM using encoded values 
as indices, allocates the storage space in MB for 
the tuples of the first relation being joined, and 
deposits the tuples of the second relation into the 
appropriate server queues. Registers D, T, and 
BR(i), 0 < 1 < p-1l, are provided for allocating 
storage space for storing the tuples of the first 
relation being joined. 


Computing of Join Operations 


This section describes the implementation al- 
gorithms on the extended hardware. As suggested 
above, the joining of relational tuples is done by 
the extended hardware. In our implementation, im- 
plicit and explicit joins are treated in different 
ways. We will first discuss the implementation of 
implicit joins and then discuss the explicit joins. 


Queries Involving the Implicit Joins 


We use an example to illustrate 
bit array store is used to implement 


how the single 


Example 1. Print all the green items sold by the Dl 


department. 


To answer this query, a simplified database 
withtables SALES and TYPE is assumed in Figure 2. 
The query can be implemented in various ways. One 
way is to apply the selection process to SALES to 
select the items sold by Dl. The selected items 
are then used as a disjunctive condition to match 
the TYPE tuples. The green items of the matched 


~~ 


~: 


implicit joins. 
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tuples are output to the MFC. The procedure imp le- 
mented by the single bit array store rA is outlined 
below: | 


(1) Reset rA. 


' (2) Scan table SALES by the AP and output the 
items sold by Dl to the extended hardware, specifi- 
cally to the queue Q of the IP. The items are 
fetched and used as indices to set the appropriate 
rA bits to 1. After this step, all the items sold 
by Dl are recorded in rA, each having one corres- 
ponding set bit in rA. 


(3) Scan table TYPE and output all the green 
items to Q. These items are then checked against 
items sold by Dl. This is done by using the items 
in Q as indices to address the corresponding rA 
bits and outputting the items whose corresponding 
rA bits are set to 1. (We neglect the encoding and 
decoding processes here.) 


The discussion above assumes that all the en- 
coded ITEM values are within the address space of 
rA. If not, the procedure is repeatedly applied. 
The ith (i>1) repetition is applied to the value 
range between 2t+1-1 and (2t+i-1), where t is the 
number of bits required for the address space rA. 

If the values selected from the second rela-. 
tion are again transferred to match tuples in the 
third relation, another single bit array store is 
required. In general, two stores are sufficient, 
which can be alternatively used for a query involv- 
ing a chain of implicit joins. 


Implementation of Explicit Joins 


There have been proposed several different 
approaches to this type of join [4,7,10,12]. One 
way is to sort the tuples of each relation being 
joined into blocks of tuples based on the join 
column values. The tuples in a block have the same 
join column value. Each tuple in one block is then 
concatenated with the tuples in the corresponding 
block of the second relation. The concatenation is 
rather straightforward. 

Our approach is very similar to this 
However, it does not actually perform the 
Instead, tuples in one relation are first 
into blocks of tuples, according to their 
column values, and then are stored in the MB (see 
Figure 3). No block is allowed to be stored in 
more than one module. The information about the 
location of each block of tuples is stored in RAM. 
In Figure 3, a block of 3 tuples with join column 
value Bl is stored at the location 100 of module 0, 
i.e., M(O). The location information of this block 
is stored in an r-word addressed by Bl (assume Bl 
is encoded as 1). The setting of the corresponding 
rA bit to 1 indicates that there is a block of 
tuples with join column value Bl in the first rela- 
tion. 

After the first relation is stored in MB and 
the location information is entered into RAM, the 
AP system starts outputting the tuples of the sec- 
ond relation to the CP. The join column value of 
each incoming tuple of the second relation is ex- 
tracted and used as an index to address the RAM. If 
the addressed rA bit is 0, then the tuple is dis- 
carded because there is no match. If set, the mod- 
ule number of the addressed r-word determines the 


approach. 
sorting. 
divided 
join 


queue in which the incoming tuple will be deposi- 
ted. It is concatenated to the location number of 
the addressed r-word so that, when the.tuple is 
processed, the server would know the location of 
the block the tuple will be concatenated to. The 
arrangement described above permits the array of 
servers to produce the concatenated tuples of the 
join from their queues and the corresponding mod- 
ules, without any memory addressing contention. 

The algorithm assumes that MB is large enough 
to hold an entire relation to be joined and the 
block size is less than or equal to the module 
size. If not, additional effort is needed. 


Analysis 


The analysis has concentrated on how the nun- 
ber of servers and the length of the server queues 
affect the hardware performance in computing the 
explicit join operation. It is divided into two 


aspects: one is to, given an application, determine 


the number of servers required and the length of 
their associated queues so that tuples deposited 


to the array of servers will not be blocked (theo- 


retically). The application can be characterized 
in terms of many factors. In our analysis, it is 
defined by the ratio of the number of tuples in a 
relation to the number of distinct join column 
values and will be referred to as the "multipli- 
city". 

The analysis is based on a hardware simulator 
developed on the PDP-11/70 which simulates the 
functions of the proposed hardware. The applica- 
tions are generated using a random number genera- 
tor. 

Our analysis results are stated below. The 
number of servers required for each application 


Nets (multiplicity) ?’>. If each application runs 
on its required number of servers, the analysis 
indicates that only a few units of tuple are re- 
quired for each server queue to achieve good per- 
formance. If the number of servers is fixed, then 
the performance logorithmically degrades if the 


multiplicity of an application is greater than that 


of the application running on that number of servers. 


Summary and Conclusion 


A hardware architecture which can efficiently 
compute the relational join operation has been 
described. The main feature of this architecture 
is a RAM which can rapidly remember or recall data 
for computing implicit joins. 
the tuples of the relations into blocks of tuples 
for computing explicit joins. 


retically) is a function of multiplicity, indepen- 
dent of the cardinality of the resulting relation. 


This hardware provides powerful join capabili- 


It can help dividing 


The analysis results 
show that the number of servers required foratuple 
deposited to the array without being blocked (theo- 


ties to AP systems, especially when applied to join- 


dominated database applications. We believe that 
it can be adapted to the current VLSI technology. 
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A VLSI MODULAR ARCHITECTURE METHODOLOGY 
FOR REALTIME SIGNAL PROCESSING APPLICATIONS 


Hungwen Li 
RCA Advanced Technology Laboratories 
Camden, NJ 08102 


Abstract — computer system architecture is cur- 
rently under development for realtime processing ap- 
plications having widely varying throughput require- 


ments. Modularity at the VLSI chip level is there- 
fore the highest priority attribute in the design 
process. 
VLSI primitives —one for computing, the other for 
interconnecting —were defined for use as "build- 
ing blocks" to customize the amount of hardware 
required. To maximize VLSI modularity, hardware 
system architecture was established incorporating 
the two VLSI primitives mentioned above and an 
existing microprocessor for two groupings of hier- 
archies, cluster and system. The software system 
architecture adheres to the dataflow concept. Syn- 
chronized at the functional level, the signal 
processing operations that constitute the software 
are data-driven and data-independent. The soft- 
ware architecture was developed orthogonal to the 
hardware system architecture to ensure independence 
and transportability. In addition, the orthogon- 
ality between the hardware and software system 
architectures provides the flexibility to adjust 
the hardware/software mixture as required to the 
application without major redesign of the system 
components. A system architecture simulator, to 
evaluate various candidate system architectures in 
response to a set of system parameters and con- 
straints, is being implemented to aid in the trade- 
off of the hardware/software mixture. 


Introduction 


The development of computer system organiza- 
tion (system architecture) for future realtime 
signal processing applications including radar, 
sonar, telecommunications, and speech [1] is under- 
way. Modularity at the VLSI chip level is the 
chief criterion and the hierarchical use of two 
VLSI modules in the system architecture appears 
highly effective logistically. 


The higher-level attributes of the signal 
processing system vary from application to applica- 
tion. Obviously, the system requirements of a 
radar system to simultaneously track 400 targets 
are far removed from the specifications of a 
speech terminal providing secure voice transmission; 
the algorithm for detecting a target via sonar 
signal processing bears no resemblance to the modu- 
lation algorithm of the telecommunication signal 
processing. 


However, the lower-level attributes of such 
systems, upon which the higher-level attributes are 
built, are functionally similar over a wide range 
of applications and are well defined. On this 
basis, a set of VLSI primitives and a unified 
system architecture can be defined for a host of 
applications, each with differing performance 


requirements. Such primitives and architecture 
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In support of this modularity concept, two 
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will provide a sound foundation on lower-level 
attributes and allow the system designer to focus 
on the design of the higher-level layer effi-~ 
ciently. 


Two VLSI primitives were identified to 
facilitate the developing of lower-level attri- 
butes. The first, a Programmable Signal Processor 
(PSP) [2], supports versatile vector operations 
which provide a structure for various higher-level 
algorithms. The second, a Signal Processor 
Interconnection Switch (SPIS) [3], offers a full 
connection among participating multiple PSPs, 
allowing algorithms to be synthesized for applica- 
tions having different performance requirements. 
The architecture and characteristics of the VLSI 
primitives will be discussed in the next section. 


The hardware system architecture relies heavily 
on the SPIS interconnection mechanism. Using the 
SPIS according to the cluster and system hierarchy, 
which is described under the Hardware System 
Architecture section, maximizes the VLSI modularity. 
The software system architecture follows the data 
flow concept [4,5] at the functional level and was 
developed separately from the hardware system 
architecture to ensure independence and transporta- 
bility. Orthogonality between the hardware and 
software system architectures additionally provides 
the flexibility to adjust the hardware/software 
mixtue as required by the applications without 
major redesign of the system primitives. To aid 
the tradeoff of hardware/software mixture, a 
system architecture simulator (SARSIM) is being 
implemented to evaluate various candidate system 
architectures in response to various system param- 
eters and constraints. 


VLSI Primitives 


To fulfill the computational requirements of 
signal processing on an incremental basis, a high 
throughput programmable signal processor with mod- 
ularity at the VLSI chip level is essential; fur- 
thermore, a unified interconnection mechanism that 
constructs a multiprocessor system and can be 
implemented preserving VLSI modularity is very 
desirable. This concept identified and defined 
two VLSI primitives, the PSP and SPIS. 


Programmable Signal Processor (PSP) 


The PSP is a two-chip device consisting of the 
controller and the Register Arithmetic Logic Unit 
(RALU). Itis designed for the 132-pin package and 
1.25 um CMOS/SOS technology. A clock rate higher 
than 50 MHz, operating at 5 volts, is required. 
Figure 1 illustrates the major PSP components 
and their interfaces to the program memory, the 
data memory, and the control unit which will be 
described in the section on system architecture. 
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Fig. 1. Block Diagram of Programmable Signal 
Processor. 


RALU. The RALU hasa horizontally microprogram- 
mable, pipelined architecture which performs com- 
putation-intensive signal processing functions. 

As shown in Fig. 1, the data path of RALU is or- 
ganized as a multiplier/dual-adder structure to 
optimize the Fast Fourier Transform (FFT) compu- 
tation as well as popular signal processing al- 
gorithms such as filtering and convolution. 


Major components of the RALU include eight 
l6-bit registers, a 16-bit-by-16-bit multiplier, 
and two ALUs. Separate input and output lines 
support simultaneous memory read and write. Con- 
trolled by a 16-bit u-instruction, all activities 
in these components occur concurrently yielding 
a throughput of more than 100 million operations 
per second (MOPS). 


A list of operations directly supported by 
RALU are given in Table 1. These operations are 
supported in 16-bit integer, 32-bit complex, 16- 
bit block floating point, and 32-bit complex 
block floating point formats. 


TABLE 1. FUNCTIONS SUPPORTED BY PSP 


Function 


| FFT | CONVOLUTION | 
MULTIPLY /ACCUM INTEGRATE 
| ADD/SUB POLYNOMINAL 
DIVIDE POISSON 
SQUARE ROOT GAUSSIAN 
LOGICAL MIN 


SUM OF ABSOLUTE 
AMDF 


CONJUGATE MAX 
FILTERING THRESHOLD 
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The VLSI modularity is further enhanced by 
the RALU. For example, one RALU can execute the 
FFT butterfly in 4 clock cycles. However, two 
RALUs can be arranged side-by-side to execute a 
2-cycle butterfly; more significantly, four RALUs 
can be arranged in parallel to accomplish a l- 
cycle butterfly for applications requiring a very 
high throughput. 


Controller. The PSP controller supports all 
the control mechanisms required by the RALU. It 
handles the program sequencing, the data memory 
address generating and control, and the interface 
to the control unit. 


The controller contains three components: 
The SEQuencer (SEQ), two Data Address Generators 
(DAGs), and the system control. The sequencer 
generates one program address per cycle to pre- 
fetch one instruction for controlling the DAGs, 
the RALU, and the remainder of the system. The 
sequencer also handles the command buffer, the 
interface to the control unit for passing the 
signal processing descriptors (described in the 
system architecture section). The DAG generates 
one data memory address per cycle to read or write 
memory data. Along with the address, the timing 
and the control signals are generated on-chip to 
simplify the system interface circuit. 


The sequencer supports immediate, register, 
and relative addressing modes and convenient 
branching functions such as IF and CASE desired 
by the program control. Two stacks, the itera- 
tion stack and the program counter stack, coopera- 
tively achieve FOR and WHILE looping actions 
described in Ada. Special LOOP action allows a 
program to operate continuously without resetting 
th iteration count. This feature is particularly 
useful in a clock-driven, high data-rate environ- 
ment. 


Flexible addressing modes and a wraparound 
mechanism provided by DAG handle wraparound for 
circular buffers and corner-turning of two- 
dimensional memories. In this way a wide spectrum 
of data memory accessing patterns (such as bit 
reversal of FFT and window movement) generic to 
the signal processing applications can be managed.. 


Signal Processor Interconnection Switch (SPIS) 


The most challenging problem faced in imple- 
menting VLSI modularity when a large number of 
PSPs are interconnected is overcoming the pin 
limitation of the package. Recognizing this, the 
SPIS separates the passing of data from the con- 
trols so that the data path can be bit-sliced, 
which allows more PSPs to be connected in one 
module. 


Figure 2 depicts the data path of SPIS. Each 
signal processor is equipped with a lccal memory 
and is allowed to have its own port attached to 
the shared mrmory via SPIS. The shared memory is 
organized as a B-word-wide memory with word size 
W (where B is the block size and is the smallest 
addressable unit of the shared memory). The block 


The bit-sliced SPIS architecture overcame the 
pin limitation of the VLSI package and made it 
feasible that at least a 32 x 32 configuration 
can be supported with today's technology (<100K 
transistors and 132-pin package). With a large 
number of interconnections supported per chip, 
off-chip switching delay is reduced. This 
leads to a higher data-transfer rate, which con- 
tributes significantly to the performance of the 
data flow software architecture (discussed in the 
next section). In addition to the above-mentioned 
advantages, the bit architecture matches ideally to 
the bit-organized memory, leading to easier imple- 
mentation of error-correcting codes which ensures 
higher reliability. 


The major drawback of the SPIS architecture 
is that any crosspoint failure will potentially 
affect the overall system because of its bit- 
sliced structure and full connectivity. This, 
however, can be easily corrected by having redund- 
ant SPISs for each bit path. 


System Architecture 


Fig. 2. Schematic Diagram of SPIS Data Path. 


The system architecture development is divided 
into hardware and software architecture, both of 
which are developed orthogonally to provide not 
only independence and transportability, but also 
the flexibility to adjust the hardware support as 
required by the application or as restricted by the 
physical or cost constraints. 


size B is programmable and is subject to the num- 
ber of ports (N) and the relative speed between 
the shared memory and the local memory accessing. 


A control unit (not shown) is dedicated to 
control the SPIS operation via a control bus 
(CBUS) to which every signal processor attaches 
and sends commands. The control unit accepts : 

: _ : : i : software System Architecture 
commands in a round-robin fashion, with a fixed 
time slot allocated to each signal processor. 
During each time slot, the control unit may 
perform one of the three operations as summarized 
in Table 2. These operations are performed by 
sending control signals from the control unit to 
the shared memory unit, SPIS chips, and each sig- 
nal processor. 


The software system architecture follows the 
concept of data flow because signal processing 
functions are basically data-driven and data-inde- 
pendent. Unlike the proposed data flow concept [4], 
which synchronizes the operations at the arithmetic 
level, the data flow advocated here for signal 
processing synchronizes the operations at the 
functional level. This was chosen because most 
signal processing functions operate on vectors of 
reasonably large size. Consequently, functional- 
level data flow reduces the overhead of passing 
both the data and the descriptors. 


In any operation, a parallel transfer (B 
words) occurs at the shared memory end while 
serial transfer (one word) occurs at the local 
memory end. This requires one bit storage per 
crosspoint and one vertical connection per column 
(Fig. 2), which enhances the crossbar switch and 
increases the complexity of SPIS accordingly. A 
first-cut logic design indicated that a 32 x 32 
SPIS chip is at the complexity of about 60K trans- 
istors. 


A detailed software system architecture has 
been defined [6]. Due to its complexity, only the 
portions sufficient to address the methodology are 
presented here. Similar work [7] has been pre- 


sented recently. 
TABLE 2. BASIC COMMAND TYPES 


; : P Data Flow Model. As shown in Fig. 3, a signal 
: Source Destination| Iteration ee 
Operation Os2eana Rseeand Count processing function is represented by a node and 
P P its Input-object (I-object) and Output-object (0O- 
READ SMU- PORT # ## OF object). These objects are represented by the 
BLOCK # BLOCKS direct links going in and out of the node. The 
WRITE PORT # SMU-BLOCK # } OF Input-objects and Output-objects can ae ota or 
controls. A node is initially in the “wait state 
BLOCKS : ‘i ‘ 
and can be converted into the ‘executable’ state 
COPY PORT # PORT # # OF only when all its associated data I-objects have 
BLOCKS arrived and control I-objects are in a "true" state. 
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1-OBJECT #N NODE 
—— O-OBJECT #M 


Fig. 3. Data Flow Model. 


Data Flow Graph (DFG). Based on the flow 
model, an algorithm or application can be typified 
by a data flow graph consisting of a set of nodes 
representing the processing elements of the graph, 
a set of links or queues representing the directed 
information flow through the graph, and a command 
program carrying out control functions. Graph 
input queues provide a means of transporting 
data into the graph; graph output queues provide 
a means of transporting data out of the graph; 


and command programs interface to the data process-— 


ing subsystem. 


Each node in the graph represents a specific 
signal processing operation called the underlying 
operation of the node. The underlying operation 
may be either a subgraph or a predefined primi- 
tive operation (macro). Subgraphs allow hier- 
archical structuring in graph definition. The 
expansion is complete when a graph contains only 
nodes with macros. 


A node has a set of input data queues supply- 
ing data to the node. Associated with each data 
input queue are: 

a) A threshold value representing the mini- 
mum number of data elements that must be 
present on the corresponding queue before 
the macro can be executed. 

b) A read amount representing the number of 
data points that the macro will use as 
input data. | 

c) A consume amount representing the actual 
number of data points to be removed from 
the queue after the macro has been 
executed. 

These queue parameters are managed by a data flow 
schedule algorithm to maintain the DFG execution. 


A link or queue of a DFG represents the 
directed flow of information from node to node 
within a graph or from a node to another graph. 
There are two types of queues — data queues 


carrying data and control queues for synchro- 
nization. A command program, which carries 
out control functions in an application and 
serves as the interface between signal and 
data processing, can be associated with a graph. 
Automatic application partitioning into a 
DFG is a difficult issue and is currently under 
study. Even with the aid of some partitioning 
programs, it is believed that manual partition- 
ing will still play the most important role in 
constructing a DFG. 


Object Database, Data Flow Schedule Algorithm, 
and Synchronization. The DFG contains two types 


of information: the topology which illustrates the 
input/output relationship between nodes, and the 
descriptor which contains the node name and the 
queue parameters. Both topology and descriptor 
information must be described in Signal Process- 
ing Language (SPL) [6] and translated into the 
object database for the graph execution. An ex-. 
ample of the object database can be found in 

Fig. 4. 


The descriptor includes the name of the node 
and the queue description for each link. Each 
queue descriptor includes queue type (control or 
data), input/output type, capacity, consume 
amount, produce amount, input/output pointer, etc. 


The schedule algorithm utilizes the object 
database to execute the data flow model. It is 
this mechanism that synchronizes the nodes in a 
DFG. Each node has an indication of its state 
(wait, executable, processing, or finished) in 
the object database. Some hardware entities 
examine these states; change them from "wait" to 
"executable;'"' and move the descriptor of the 
executable nodes to the PSP for execution. The 
synchronization is automatically established by 
the data flow model and the object database. AI1l 
nodes are executed asynchronously; however, maxi- 
mum parallelism is allowed when sufficient numbers 
of processes are available. 
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Fig. 4. Topology and Object Database Generation. 


Deadlock. A deadlock situation is possible 
when there exists a feedback path in a DFG; how- 
ever, it can be prevented by creating a control 
object along with the feedback path. The feed- 
back control is initially "true"; it becomes 
"false" after the execution of the node that 
accepts the feedback path. This feedback con- 
trol object will be triggered to be "true" via 
the execution of the node generating the feed- 
back. By constructing a logically sound DFG, 
the deadlock can be totally prevented. 


Hardware System Architecture 


Hierarchical approach and modularity are 
the highest priority factors in configuring hard- 
ware system architectures. Two levels of hier- 
archy, system and cluster, are established 
as the system architecture (Fig. 5) using 
the same SPIS primitives. 


A cluster consists of one control unit, a set 
of Signal Processor Modules (SPM); and a set of 
Input/Output Processors (IOP). These modules 
are connected by a SPIS network, a CBUS, and a 
SBUS. The cluster control unit is attached to 
the object database and is responsible for execu- 
ting the data flow schedule algorithm described 
previously. The SPM is responsible for computa- 
tion and consists of the PSP primitive, the local 
memory, and a local control unit interfacing with 
both CBUS and SBUS. 


The IOP consists of a local memory and a 
local control unit interfacing not only with CBUS 
and SBUS but also with the Cluster CBUS and Clus- 
ter SBUS. The IOP is responsible for inputting 
data from the sensor, outputting data to the data 
processing subsystem, and transferring data among 
the clusters. The IOP can be implemented with 
existing microprocessors from which the local/ 
cluster/system control unit may also be con- 


structed without further dedicated VLSI primitives. 
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Fig. 5. 


Hardware System Architecture. 
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In the hierarchical approach, several clus- 
ters grouped together constitute a system. At 
the system level of hierarchy, identical inter- 
connection methods and modules are adopted, with 
the exception of replacing the SPMs by the clus- 
ter block in which one or more IOPs may communi- 
cate via the Cluster SPIS network and Cluster 
CBUS. A Cluster SBUS and a system control unit 
with an object database are also available for 
the execution of the data flow. After signifi- 
cant data reduction by the signal processing 
front end, data processing is performed. The 
data processor, which also houses the command 
program, is most appropriately connected 
at the system hierarchy. 


In light of the realtime signal processing 
applications, the hierarchical architecture is 
the ideal structure for the two developed VLSI 
primitives. Hierarchical approach groups tightly- 
coupled nodes into a cluster and several clusters 
into a system, localizing the communication traf- 
fic and maximizing the utilization of the SPIS. 
This type of grouping is best depicted by several 
channels of identical signal processing and is a 
natural and logical way of mapping the signal 
processing problem to the hardware. Linked to 
the software architecture, the hierarchical hard- 
ware approach is almost a one-to-one mapping to 
the hierarchical expansion of the subgraph, which 
strongly indicates the high modularity of this 
methodology in both hardware and software dimen- 
sions. 


The choice of the hierarchical architecture 
is also driven by the physical limitation of the 
packaging. Using standard chassis and printed- 
circuit boards, about 32 programmable signal 
processors can be assembled in one chassis as a 
cluster. The physical interconnection of clusters 
can be conveniently done in one independent 
chassis. This simplifying approach achieves high 
modularity even at the chassis level. 


Many issues in the hardware system architec- 
ture (e.g., number of SPMs and IOPs in a cluster, 
the communication bandwidth of CBUS and SBUS, and 
the structure of the object database, etc.) remain 
undetermined. These issues are, more or less, 
application dependent and should not be totally 
predetermined until the application requirements 
are specified. To resolve these issues, a design 
automation tool at the system architecture level 
is needed and will be discussed in the next 
section. 


System Architecture Simulator (SARSIM) 


The system architecture simulator is a tool 
for evaluating the performance of a candidate 
system architecture before building the prototype 
hardware. The idea behind SARSIM is to input the 
parameterized architecture attributes — including 
the hardware system architecture in terms of PMS 
notation [8], the software system architecture in 
terms of the data flow schedule algorithm, and 
the application in terms of DFG — into the simula- 
tor and to allow the system designers to observe 


and collect the statistics of the interesting 
parameters. SARSIM consists of six software 
modules. The modules’ functions are described 
below. 


The graph translater converts the DFG codes 
in SPL to the object database for use by the 
policy handler in the execution of a data flow 
model. 


The topology handler inputs the system con- 
figuration described in PMS notation and its 
associated parameters (e.g., delay of a bus). The 
handler then builds a network of queues with cor- 
responding queue disciplines for the manipulation 
of the global clock handler. 


The policy handler module, where the data 
flow schedule algorithm resides, can implement 
a variety of algorithms following the same model. 
By observing the output of the statistic handler, 
the algorithm performance can be measured to aid 
in choosing the algorithm. 


The task handler mimics the PSP execution, 
generates the appropriate delay information to the 
statistic handler, and transmits execution status 
to the policy handler. 


The statistic handler collects and reports 
the interesting parameters such as the CBUS delay 
caused by the contention, SPIS performance as a 
function of the number of ports, and the impact of 
the object data base structure on performance. 


The global clock handler mimics the parallel 
events sequentially and drives the simulator. It 
examines every queue generated by the topology 
handler, services the pending requests in the 
queues, and adjusts the global timing information 
for each queue so that the statistic handler can 
perform the statistic calculation. 


Status and Future Study 


In the aspect of the VLSI primitive, the PSP 
has been defined at the register transfer level 
(RTL) and an RTL simulator has been implemented 
to validate the correctness of the architecture 
and the completeness of the instruction set by 
implementing a set of signal processing macros. 
Furthermore, the logic designs of the RALU and 
the SPIS were completed. 


The definition of the software system archi- 
tecture and the signal processing language have 
been completed and documented [6], while the de- 
tailed hardware system architecture needs to be 
investigated. 
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The conceptual definitionof the system archi- 
tecture simulator has been finished and its imple- 
mentation is currently proceeding. After its com- 
pletion, a series of hardware system architectures, 
data flow schedule algorithms, and different ob- 
ject database structures will be tested. Expected 
test results are a family of performance curves 
serving as the guidelines of the design space for 
the hardware/software tradeoff. 


The fabrications of PSP and SPIS are planned, 
which a hardware testbed will be constructed 
signal processing system prototype. 


from 
as a 
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EMSY85 -— The Erlangen Multi-Processor System for a 
Broad Spectrum of Applications. 


G. Fritsch, W. Kleinoeder, C.U. Linster and J. Volkert 
Institute of Computer Science (IMMD) 
University of Erlangen - Nuremberg 
Federal Republic of Germany 


ABSTRACT 


A new Erlangen multiprocessor system, EMSY85, will consist of a 
grid-like array of microprocessors operating asynchronously, each of 
which is coupled via memory with a limited number of its neighbors. 
We intend to demonstrate that for a broad spectrum of applications the 
system's performance can grow nearly in proportion to the number of 
processors in the array. The operating system is based on UNIX*. A 
programming environment for parallel programs makes the system attrac- 
tive to the users. Many design decisions have been based on the 
results of an existing pilot project. 


1. Introduction and Motivation processors that happen not to be neigh- 
bors. The higher-level processors can 
also perform tasks other than  supervi- 
SLOM. Thus EMSY85 is also well suited 


for tree-like user applications. In 


Numerous attempts have been made in 
the past few years to increase computing 
power by means of multiprocessor systems 


with various architectures. Two such addition, results from the pilot project 
systems have been implemented in EGPA lead us to believe that many user 
Erlangen, SYMPOS [12,15], which is a applications with a  subtask structure 
Symmetrical system, and EGPA** [5], that is neither grid-like nor tree-like 


which is hierarchical. can easily be mapped onto, 


efficiently on EMSY85. 


and computed 


The experience gained with applica- 


tions on these two projects led scien- Therefore the project's main goal 


tists at the Computer Science Department is to show that for a broad spectrum of 
(IMMD) of the University of Erlangen - applications, system performance can 
Nuremberg to conceive EMSY85, which will indeed grow nearly in proportion to the 


be implemented in the next few years. number of processors in the array. We 
EMSY85 consists of a grid—-like array of are thus planning to test a large number 
microprocessors operating asynchro- of algorithms from such fields as phy- 
nously, each of which is coupled via Sics, chemistry, operations research and 
memory with a limited number of its image-processing, many of which have 
neighbors. Because of the large number already shown large speedups on the 
of processors, a symmetrical system pilot project. In order to make the 
(each processor has access’ to each system attractive to potential users, 
memory modul) is unrealistic [6]. On the complexity of programs written for 
the other hand because of the many the system must not be essentially 
computation-intensive user applications greater than that of programs written 
with a matrix structure, a processor for amonoprocessor. Not only must the 
field with a grid structure was chosen. system's higher-level language contain 


Above this array there is a hierarchy of 
processors, whose job it is to supervise 
the array and to transport data between 


* UNIX is a Trademark of Bell Laboratories. 


** EGPA was supported by the Federal 
Ministry for Research and Develop- 
ment, F.R.Germany 
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constructs for asynchronous programming, 
there must be a programming environment 
capable of supporting the development of 


parallel software. 


The paper describes five aspects of 
the EMSY85 project: the hardware, the 
operating system, the programming 
environment, measurement and performance 
aspects and applications. 


2. The EMSY85 - Architecture 


The Erlangen multiprocessor system 
EMSY85 will consist of identical 
Processor—-Memory Modules (PMMs). Each 
PMM will consist, in turn, of an iAPX 
286/287 microprocessor and a one-half 
megabyte multiport memory. The PMMs are 
arranged hierarchically in four levels 
(A,B,C,D) as shown in Fig. 1. 
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Fig. 1: EMSY 85 - Hardware-Structure: Four hierarchical levels 
A, B, C, D. Some of the elementary pyramids are highlighted. 


At the topmost level there is 
single PMM, for which there will be 
several standby processors to increase 
the system's reliability. At each lower 
level, each PMM is connected to exactly 
four neighbors at the same level, i.e. 
each processor has access to the 
memories of its neighboring PMMs. Thus 
at each level the hardware has a grid 
structure. 


only a 


326. 


In addition to the horizontal 
accesses, each PMM, except of course the 
very lowest, has access to four PMMs at 
the next lower level. The vertical con- 


nections are equipped to broadcast data 
downward to all four of the lower PMMs 
Simultaneously, say, to transmit code 


segments. On the other hand, though the 
lower PMMs do not have memory-access’ to 
those at higher levels, they are able to 
interrupt their supervising PMM. These 
substructures, consisting of four pro- 
cessors and their supervisor, are called 
elementary pyramids c.f. Figure 2. 
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Fig. 2: EMSY 85 - Elementary pyramid 


Several of them are highlighted in 
ure 1. 


Fig- 
Each elementary pyramid has an 
I/O processor with the corresponding I/0 


devices, for the most part, Winchester 
disks. The I/0 processor, which has 
access to all of its elementary pyramid 


memories, 
ing PMM. 


is controlled by the supervis-—- 


The overall arrangement of EMSY85's 
PMMs is, as the preceding discussion 
suggests, pyramidal. The topmost PMM is 
connected to a network containing 
several software-development processors 
on which the operating system and appli- 
cations are in development. 


An EMSY85 pyramid can, of course, 
be extended downward arbitrarily, by 
adding new levels. Such an extension 
increases significantly the computing 
power of the system. 


For many kinds of computations, the 
main computing load will be carried by 
the lowest level, leaving the higher 
levels underutilized. For such applica- 


tions it would be reasonable to taper 
the pyramid so that each lower level in 
an elementary pyramid has nine, or even 
sixteen, rather than merely four PMMs. 


Experience with the pilot project, 
EGPA, which was built using five power- 
ful miniprocessors, has shown however 
that excessive tapering can lead _ to 
bottlenecks at the higher levels’ that 
restrict the system's overall perfor- 
mance. EMSY85 will therefore have two 
manually switchable configurations, one 
more strongly tapered than the other: 


Lowest level PPMs: ox. 9xJ 
Element. pyramid PPMs: 4 + 1 9+ 1 
Number of levels ee. S 

Total PPMs: 644+16+4+1 = 85 814+9+1 = 91 


3. EMOS — the EMSY85 Operating System 


The operating system, which is 
based on UNIX will be structured ina 
hierarchy analogous to the hardware. 


The operating system consists of more or 
less autonomous subsystems, one per pro- 
cessor. Each of the subsystems has a 
common kernel, but the subsystems on the 
lowest level are rudimentary and 
increase in power toward the top of the 
pyramid. 


found in the 
is unsuited to 


often 
UNIX 


The opinion, 
literature, that 


multiprocessor systems can no longer be 
maintained without qualification. The 
multiprocessor project "SYMPOS - An 


Operating System for Homogeneous Mul- 
tiprocessor Systems" has shown that UNIX 
can be modified with relatively little 
effort and essentially without changing 
its structure. The effort for such a 
modification depends on the complexity 
and homogeneity of the hardware, the 
desired user-friendlyness, and the 
planned spectrum of applications. In 
addition, the question is relevant 
whether the user should have the possi- 
bility of implementing genuinely con- 
current processes, or merely quasi- 
concurrent processes. Considering all 
of these parameters, we estimate a total 
effort of 4 to 16 man-years. 


In order to avoid the phenomenon of 
processor-thrashing in memory-coupled 
multiprocessor systems, we adopted and 
expanded on an idea that found limited 
application in the CMU multiprocessor 
projects [8] The problem consists of 
relieving the bottleneck involved in 
common memory-access; the solution con- 
sists of maximizing the amount of code 
and data for local functions loaded into 
local memory. The procedure leads to an 
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operating system that consists of a 
number of local subsystems based on a 
common kernel. This approach was imple- 
mented so successfully in SYMPOS that it 
will be adopted for EMSY85. Subsystems 
of varying power for the different 
hardware levels are easily provided 
Since the system is partitioned into 
modules that can be freely combined to 
form a complete system. 


In spite of the fact that EMSY85 is 


not a symmetrical system, similar prob- 
lems arise since every memory is 
equipped with seven ports, independently 


of the size of the elementary pyramid. 
In view of the number of accessing pro- 
cessors, measures Similar to those in 
SYMPOS will certainly be necessary to 
prevent bottlenecks. 
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Fig. 3: EMSY 85 - Process-Structure 


UNIX insiders will note the congruence 
between the hardware structure and 
UNIX's process structure. It is thus 
fairly easy to extend the original 


process-management to a distributed sys- 
tem. We shall use the fork function 
(spawn a process) as an example. 


on a 
as 


In the local environment, i.e. 
Single processor, fork functions 


usual. There is in addition a distri- 
buted version, pfork whose effect is 
analogous to its local cousin. The 
difference is merely that for a pfork 
process a new hardware environment is 
initialized. The two processors 
involved must process a common physical 


memory space; in case more than one gen- 
eration level is involved the memory 
Spaces may be disjunct. The hardware to 
which a pfork process iS assigned can 
be influenced by parameters such as 
access to non-local data segments, pro- 
cessor assignment, and so forth. 


Figure 3 shows a typical process 
structure resulting from multiple pfork 
and fork operations. The process iden- 
tifiers consist of a triple identifying 
hardware level, processor number, and 
local process identifier. 


In addition to the 
specific analogues fork, wait, exit, 
alarm, etc. there will be a number of 
completely new functions for the coordi- 
nation of asynchronous processes”) and 
management of global files. The new 
coordination functions include mechan- 
isms such as semaphores, lock/unlock and 
message switching. 


multiprocessor- 


In this section we have discussed 
Several aspects of process management in 
the EMSY85 Operating System. 


research areas relevant to the project 
include resource management in 
multiprocessor-systems, online reconfi- 
guration after hardware failures’ and 
adaptable management strategies, among 
others. 
4. A Programming Environment for Paral- 
lel Programs 

The operating-system interface in 


the EMSY85 multiprocessor system permits 
the implementation of a user's algorithm 


as a system of cooperating concurrent 
processes. In order to permit the use 
of such a multiprocessor system by non- 


Specialists, tools are provided in a 
parallelprogramming environment oriented 
to the user's problem rather than to the 
system's architecture. 


Of course it would be ideal if the 
task could be distributed automatically 
to the various processors. The user 
could then program his application as if 
it were a sequential process. Current 
experience with the EGPA system leaves 
us skeptical about the success of com- 
pletely automatic analysis of sequential 
into parallel algorithms. 


The parallel programming environ- 
ment therefore assumes the following 
model: an application consists of 


Further 
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sequential subtasks and a description of 
their interdependencies, either con— 
current or sequential. The individual 
subtasks are formulated in a powerful 
higher-level programming language (C, 
because of the use of UNIX). fThere is a 
programming package that permits easy 
formulation of the required synchroniza-— 
tion by means of such calls as 


"execute subtask x on processor y" 


“wait for the termination of subtask x 
on processor y" 


The programming package implicitly 
includes the generation of the requisite 
system of processes, handles the indivi- 
dual calls necessary for process commun- 
ication (e.g. using messages), and ini- 
tiates the subtasks on their respective 
processors. This approach has been suc- 
eee tested in the pilot project 
mae 


Another approach that is much 
easier to use and which has proved effi- 
cient as well, has been borrowed from 
data-flow theory. The theory that 
applies to elementary operations such as 
"+" or "*" vields an elegant generaliza-— 


tion to subtasks, viewed as complex 
operations, which we call macro data- 
flow. The synchronization dependencies 
between subtasks can be expressed in 
analogous fashion. In the EGPA pilot 
project we have shown that these tools 


are especially attractive 
tions whose asynchronous 
extremely complex [10]. 


for applica- 
structure is 


5. Measurement and 


B) Performance 
tion 


Evalua- 


The measurement and evaluation sub-—- 
project of EMSY85 is intended to support 
the users in optimizing performance, 
whether they be implementing the operat-— 
ing system, the programming environment, 
or applications. Since the simultaneous 
operation of nearly one hundred proces-— 
sors is beyond human comprehension 
without some kind of visual support, 
there is a software-measurement system 
applicable to all levels of the systen, 
whose measurement points can be selected 


arbitrarily. A trace of the selected 
events is recorded along with timing 
information. 


Then a direct evaluation of this 


information is made possible by an on- 
line visual-display package. Not only 
can the system's current status be 


displayed, the trace data can be used to 
display the flow of events in slow- 
motion. 

Since the 


measurement points are 


chosen 
ment, 

user's 
names, 
numbers. 


using the programming environ- 
the data can be displayed in the 
notation, i-e. with his symbolic 
rather than as hexadecimal 


It would, of course, involve exces- 
sive implementation effort to attempt to 
find an "optimal" version among several 
candidate implementations by measuring 
their performance. However a modeling 
and evaluation procedure for complex 
tasks based on stochastic analysis of 
the measurement data permits a choice 
among interesting variants [9]. The 
task model used to implement this sub- 
system is based, as are the 
environment's other programming-support 
tools, on the data-flow approach. 


A survey of research on hardware 
and software measurement as well as per- 
formance evaluation in the EGPA pilot 
project, which constituted the planing 
basis for the measurement and evaluation 
software in EMSY85, can be found in [3]. 


6. Applications 


EMSY85 will be capable 
menting a broad spectrum of applica- 
tions. Among them will be tasks that 
require intensive computation, e.g. 
problems from physics, operations 
research and pattern-recognition. 


of imple- 


Such problems can often be reduced, 
either directly or via discretization, 
to problems in linear algebra. An 
important research area is the discovery 
of appropriate asynchronous algorithms, 
especially for the solution of large 
systems of linear equations with either 
sparse matrices (10**6 unknowns), e.g. 
finite elements or systems of differen- 
tial equations, or dense (10**3 to 10**4 
unknowns). These problems are not dif- 
ficult to implement on EMSY85 because of 
the close match between the structure of 
the tasks and the system's array. For 
other classes of tasks, the match is not 
as felicitous. For such problems, stra- 
tegies must be developed for mapping the 
problem onto EMSY85's hardware struc-— 
ture. Not all applications algorithms 
can be adapted without modification, and 
must in fact sometimes be re-developed 


from scratch. Such problematic tasks, 
e.g. from pattern-recognition, non- 
linear programming (say for technical 


installations), simulation of compli- 
cated systems (telephone networks, mul- 
tiprocessors, VLSI circuits), and _ so 
forth, are also to be investigated. 
These will require research in the 
decomposition of tasks, the adapting of 
asynchronous algorithms, and the design 
and implementation of parallel programs. 
The computing demands of several of 
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these tasks, especially those from 
pattern-recognition, comprise both high 
computation speed and high data-transfer 
rates. 


By means of these applications, we 
intend to show that, for problems of 
sufficient size, the system's perfor- 
mance improves nearly linearly in the 
number of array-processors (level A). 
Considering the breadth of the applica- 
tions areas, EMSY85 will have thus pro- 
ven itself to be a multi-purpose system. 


As a result of experience with the 
pilot-project EGPA (Erlangen General- 
Purpose Array), we are convinced that we 
shall indeed be able to demonstrate the 
expected improvement. EGPA's hardware 
structure corresponds to a single ele- 
mentary pyramid in EMSY85, consisting of 
four array PMMs and one supervisory PMM. 
Thus the limiting speed-up for an algo- 
rithm run on EGPA (versus a monoproces-— 
sor) is four-to-one. 


We list below a number of applica- 
tions actually tested on EGPA along with 
their speed-up factors. 


Subject: Speed up: 
Linear algebra [7]: 
—-Matrix inversion 
( 200 x 200 dense) 
Gauss-—Jordan 53 
column-substitution ca. 4.0 
—Matrix multiplication Dat 
(200 x 200) 
-Solving of linear equations 
Gauss-Seidel ca. 4.0 
Differential equations [2]: 
—-Relaxation C8 4.5.5 


Image processing and graphics: 

-~Topographical representation [10] 

-~Illumination of the topo- 
graphical model Bek 

-Line following . 
(vectorizing of a grey-level 
matrix) 

-Distance transformation 


[4] 


\N 
On 


Cas: 360 = 325 


Non linear programming [1 ]: 
-Search for minima of a multi- 
dimensional object function 
given by an algebraic term ca. 3.2 

Graph theory: 

~network flow with neighborhood ike) 
support (each idle processor 
helps one of its neighbors) 

Text formating [14]: 2.6 
These encouraging results, which 

broad spectrum of applications, 


Span a 
were an 


essential factor in the 
EMSY8&5. On the basis 


design of 
of theoretical 


investigations, speedups proportional to 


the number of processors at the lowest 
level can be expected on systems of the 
Same type. 


7. Conclusions 


The University's Institute of Com- 
puter Science is working hand in hand 
with an industrial partner, Siemens’ 
Corporate Laboratories for Information 
Technology in Munich. The cooperation 
is in the design, development and pro- 
duction of the hardware, as well as in 
the development and testing of parallel 
algorithms Tor computation-—intensive 
applications such as the simulation of 
complex systems. 


The EMSY85 project employs about 
fifty scientists at the University from 
the following academic chairs in the 
Computer Science Department: Computer 
Architecture, Performance Evaluation, 
Operating Systems, Programming Languages 
and Pattern Recognition. The German 
Federal Ministry fOr Research and 
Development is supporting EMSY85. 
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Maximum Pipelining of Array Operations on Static Data Flow Machine 
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Gao Guang Rong 
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Cambridge, MA 02139 


Abstract array where each element of the array is 

specified by the same computational rule and 

Data flow computers are a_ radical all elements may be computed independently. 

departure from conventional computer Example 1 is a Val forall expression which uses 

architecture, and new methodologies are values from two arrays B and C to construct a 
required for generating efficient new array A. 


machine-level programs from high-level user 
programming languages. In this paper, we show 
that, for certain programs in the Val 
language, it is possible to _ construct 
machine-level data flow programs that support 
fully pipelined computation. A Val program in 
the class considered consists of blocks of 
code each of which defines a new array value 
either by a forall expression in which each 
element may be computed independently, or by a 
for-iter expression that defines array elements 
by a first-order recurrence relation. 


1. Introduction 


In this paper we study the translation of 
the program structures used to express array 
computations in the programming language Val 
[1], a functional programming language 
designed for expressing computations to be 
executed by computers capable of highly 
concurrent operation, data flow computers in 
particular. 


The organization of data flow computer 
that appears most attractive to us for high 
performance computation is the static data 
flow supercomputer described in [2] [3]. A 
machine level program fo such a _ computer, 
regarded as a collection of instruction cells, 
is essentially a directed graph, with nodes 
corresponding to instructions and an are for 
each instruction destination field. We will 
use such diagrams to present data flow machine 
code structures in the remainder of this 
paper. 


Two constructs in the Val programming 
language are of major importance in expressing 
scientific computations. A forall expression 
can be used to express the construction of an 


* This research owas supported by _ the 
Department of Energy under grant number 
DE-ACO2-79ER10473 and the National Science 
Foundation under grant number MCS-7915255. 


0190-3918/83/0000/0331$01.00 © 1983 IEEE 


331 


Definition A_ pipe-structured program 


A : array[real] := 
forall 
i in [0, m+1] 
P : real := % definition 
if (i = O)|(i = m+1) then Cli] 
else 0.25*(C[i-1]+2. *C[i ]+C[i+1]) 
endif; 
construct B[i] * (P * P) 
endall 


% range specification 


% accumulation 


Example 1. A Val forall Construct 

The for-iter expression in Val is the 
construct used to express iteration—the 
computation of sequences of values in which 
the value produced in one cycle depends on the 
value or partial results produced by the 
proceeding cycle. Example 2 is a for-iter loop 
which constructs an array X. 


X : array [real] := 


for i: integer := 1; % initialization 
T : array[real] := [0: 0.] 
do 
let | 
P ; real := ALi]*T[i-1]+B[i] % definition 
in if i<mthen 4% body 
iter T := T[i : P] 
i := i+ 
enditer 
else T 
endif 
endlet 
endfor 
Example 2. for-iter Construct 


The Val programs of interest in this 
paper are those made up of program blocks, 
each of which is a_ forall or a for-iter block. 
Each block may be thought of as a producer of an 
array value, and a 'consumer' of other array 
values produced by other blocks. This simple 
structure matches the main body of many 
practical programs of computational physics. 


is a Val 


program in which all array constructions are 
defined by non-nested blocks such that: (1) 
each block is either a forall block or a for-iter 
block, (2) the index ranges of the arrays 
generated by the blocks are fixed. 


Pipe-structured programs are attractive 
candidates for implementation as fully 
pipelined machine code structures for data 
flow computers. 


2. Pipelined Mapping of Primitive Expressions 


Pipelined execution of computations is 
very natural on the static data flow computer. 
We first study the pipelined implementation of 
a restricted class of Val expressions. which 
contains no nested forall or for-iter expressions 
and no array constructor operations. 


Definition Let i be an identifier called an 
index variable. Then a primitive expression (PE) on 
i is any Val expression which may be 
constructed using only the following rules: 

(1) A scalar literal constant is a PE. 

(2) An identifier of a scalar value is a 
PE. 

(3) If E/ and E2 are PEs, then (EI op £2) 
is a PE, where op is an arithmetic or 
relational operator. 

(4) If A is an identifier that denotes an 
array, then A[i+m] is a PE, where m is 
an integer constant. 

(5) Let E be a Val _ Iet-in construct 
expressed as Let <definition> in EO endlet. 
If the definition part, contains only 
PEs and £0 is also a PE, then FEF is a 
PE. 

(6) If El, E2, E3 are PEs, then if EJ then 
E2 else E3 endif is a PE. 


If a primitive expression is formed using 


only rules (1), (2), (3), and (5), its 
implementation as an acyclic data flow 
instruction graph is straightforward, and the 


methods developed by Montz [6] may be used to 
balance the instruction graph so that it 
supports fully pipelined computation. For the 
array access operations (rule (4)) Two matters 
must be addressed to make pipelined operation 
work correctly: (1) the elements of the 
incoming array not used in the computation 
must be discarded so they do not cause jams; 
(2) buffering must be inserted to introduce 
any skew needed to balance the pipeline. As 
an example, consider the expression 0.25 * ( 
C[i-1] + 2. * C[i] + C[i+1] ) from the body of 
Example 1 in Section 1. The corresponding 
fully pipelined instruction graph is shown in 
Figure 1. Here we suppose the array C is 
represented by m+2 result packets for the 
index set {0,..., m+1}. The boolean control. 
sequences select just those array elements 
needed for the computation. The two FIFOs 
balance the pipeline by holding values of 
array elements between their arrival at the 
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identity instructions and the time when they. 
must enter the arithmetic pipeline. 
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Fig. 1. Pipelining for Array Selection Operations 


The final case is that of conditional 


expressions. The general technique is 
iliustrated in Figure 2 for the following 
example: 


if CLi] then (A[i}+B[i]) 
else 5. *(ALi]*B[i]+2.) 
endif 


Fig. 2. Pipelining for an if-then-else expression 


This instruction graph makes use of 
instruction cells (identity operations in this 
case) in which a boolean operand directs a 
result packet to destinations according to a 
tag (T or F) on the destination arc. The 
control input M directs the merge instruction 
to forward one or the other of its data 
operands. Note that to keep the program fully 
pipelined, it may be necessary to add FIFO 
buffers to both the data and the control arms. 


This discussion and examples lead us to 
the following theorem which provides a basis 
for the constructions presented in the next 
two sections. 


Theorem 1 For any primitive expression, a 
fully pipelined data flow instruction graph 
ean be constructed. 

3. Pipelined Mapping of ll 


In this section we will present the pipeline 


scheme where the array elements are generated 
in sequence by implementing the body of the 
forall construct as a pipelined instruction 
graph. 


Definition A primitive forall expression is a 
forall expression in which: (1) The index range 
is specified as [p,q] where p and q are 
integer constants. (2) The right hand side of 


the definitions and the expression in the 
accumulation part are all primitive 
expressions in i, where i is the _ index 


variable of the forall expression. 


The fully pipelined implementation of a 
primitive forall expression (Example 1 in 
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Fig. 3. Pipelining of A Primitive forall Expression 


Section 1) is shown in Figure 3. It is 
essentially the instruction graph obtained by 
cascading the instruction graphs for’ the 
definition expression and the accumulation 
expression. We suppose the input arrays B and 
C are fed to the instruction graph element by 
element for the index set {0,..., m+1}. The 


identity instructions select from the input 
arrays those elements needed for the 
computation, and the merge instruction 


combines results computed by different rules 
into the sequence of values that represent the 
constructed array. Further details of this 
implementation scheme can be found in [4]. As 
a result we have 


Theorem 2 For any primitive forall expression, a 
corresponding fully pipelined data flow 
instruction graph can be constructed. 
4, Pipelined Mapping of for-iter Constru 

To study the pipelined implementation of 
iterative programs we first define a class of 
for-iter constructs which are built on primitive 
expressions. 
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Definition A primitive for-iter construct is a for-iter 
expression with two loop variables—let them 
be i and X-—such that: (1) Loop variable i 
takes on successive integer values p, p+t1,..., 
q for successive evaluations of the for-iter 
body, and the loop terminates after i = q. 
(2) The loop variable X is initialized to the 
empty array or by X := [ r: E ] for ‘some 
integer r and some primitive scalar expression 
E, Each iteration appends to the array by X i= 
X [ i: F ]. (3) The result expression on loop 
termination is X which will be the array 
constructed by the for-iter expression. 


One scheme for implementation the for-iter 


construct is to introduce feedback in data 
flow instruction graphs [7]. Due to the 
presence of cycles, the instruction graph 


corresponding to such a scheme can not, in 
general, be fully pipelined. 


The most common problems involving for-iter 
array operations are recurrences. Example 2 
Shows the general form of a first order 
recurrence function expressed in Val. In 


fact, it is exactly the Val code for the 
following mathematical notation, 
Xj = AX;j-] + B; 
= F(a;, Xi) (2) 
where a; is the ordered pair (4A;,B;), and the 
function F is composed of add and multiply. 
Based on a solution first proposed in [5], we 


note that (2) can easily be transformed into 


x; = FCcjy X}.3) 


| initial values 


T... TFFF 


dashed box is 


companion pipeline 


Fig. 4. Pipeling of ASimple for-iter Expression 


where the pair c; is computed from the a's by 
c(1) = AjAj-74;-2 
(2) = A;A;-7B;.2 + 4;B;-7 + 3; 


This transformation is useful to us 


because x; now depends on x;3 instead of X;7,_ 


and we note that the function F has an 
execution delay of 3. Therefore we can 
compute F by using an auxiliary pipeline that 
computes c; from a; using the scheme shown in 
Figure 4. This added pipeline (see the 
dashed-line block in Figure 4) will be named 
the companion pipeline in the rest of this paper. 
Also the function computed by the companion 
Pipeline is named the companion function of the 
recurrence function. By constructing the 
companion pipeline properly, it is possible. to 
keep the whole pipeline running at maximum 
throughput. 


The previous related work on the use of 
companion functions [5] has  been- on 
conventional architecture, where the pipeline 
configuration is wired into hardware. It is 
impractical, however, to construct a separate 
hardware companion pipeline for each possible 
recurrence relation -in the computation. In 
contrast, the pipeline for a data flow machine 
is software implemented. It is more flexible 
to introduce a piece of data flow program 
which acts as a particular companion pipeline. 
Hence, it is much more attractive to apply 
this scheme on a data flow machine. Now let 
us return to the problem of classifying Val 
for-iter constructs which have good mapping 
schemes. 


Definition A simple for-iter expression is a 
primitive for-iter expression such that (1) the 
recurrence function it denotes has a companion 
function and (2) the Val expression which 
computes the companion function is a PE. 


Using the scheme presented above, we have 
theorem : 


Theorem 3 A simple for-iter expression can be 
mapped into a fully pipelined instruction 
graph. 


Some other techniques for pipelined 
implementation of iteration expressions are 
known, generally involving trading off delay 
in exchange for achievement of computativdn at 
the maximum rate. 


5. Fully Pipelined Pipe-Structured Programs 


A pipe-structured program in which each 
forall expression is primitive and each for-iter 
expression is simple has an elegant structure; 
each component is a consumer and producer of 
array values and has an implementation as a 
fully pipelined data flow instruction graph. 


‘Due to the applicative nature of the Val 
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programming language, the data dependencies. 
among the forall and for-iter expressions define an 
acyclic directed graph in which each edge 
represents a path over which an array value is 
sent from producer to consumer. Since the 
component instruction graphs are fully 
pipelined, the balancing algorithm [4] may be 
applied to the acyclic interconnection to 
produce a fully pipelined instruction graph 
for the complete pipe-structured program. 


Theorem 4 For any pipe-structured program in 
which each forall expression is primitive and 
each for-iter expression is simple, a fully 
pipelined data flow instruction graph can be 
constructed. 


6. Conclusion 


We have developed a formal model of 
pipe-structured programs for use in studying 
algorithms for balancing and optimizing 
corresponding data flow instruction graphs for 
fully pipelined operation. Interested readers 
will find a rigorous formulation and analysis 
in [4]. Investigation of the design of a 
compiler that will automatically construct 
fully pipelined code for a large class of Val 
programs are subjects for further study. 
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ABSTRACT 


A new approach to the utilization of VLSI processing 
arrays by means of the algorithms running on them is 
presented. The idea is to represent algorithms as data 
fiow graphs, and then map these graphs onto the array. 
This approach obviates the need to develop new con- 
current algorithms to utilize the parallelism inherent in 
the array, while offering a general environment for the 
realization of algorithms on semi-custom VLSI. 


1. INTRODUCTION 


The approach taken in this research to achieve paral- 
lelism within a special purpose VLSI chip, without develop- 
ing new concurrent algorithms, is the data flow approach 
[3-5]. In it, concurrency of activities is achieved at the 
lowest possible level by treating each machine instruction 
as an independent activity. This enables "fine grain paral- 
lelism" [3], not achievable when scheduling and synchron- 
ization of concurrent activities are controlled by software. 


However, we do not propose to use one of the known 
general-purpose architectures of data flow machines [3- 
5]. Instead, we suggest to map the data flow graph which 
describes the problem in hand, on a regular array imple- 
mented in VLSI. These regular arrays of identical cells 
take considerably less time to design and manufacture 
[1,2]. Also, the mapping should not be fixed but change- 
able, enabling the user to map various data flow graphs 
(algorithms) on the same chip. Regularity and flexibility 
are thus obtained, increasing the number of potential 
applications for the chip and thereby making it more 
appealing to the semiconductor industry. 


In the following we consider the hexagonal array as a 
basis for illustrating our approach. This array has a flexi- 
ble structure [1,6], simplifying the task of mapping. In 
addition, fault-tolerance may be introduced into it [6,7] 
allowing it to recover from errors by reconfiguration. We 
then propose an architecture for the processing element 
(PE) which constitutes the basic cell in the array. Also 
presented is an outline of the general graph-to-array 
mapping process. 


2. PRELIMINARIES 
In contrast to control flow computers, data flow com- 


puters have no program counter. In the latter, an instruc-- 


tion is ready for execution when all its operands have 
arrived. Consequently, all such instructions may be exe- 
cuted in parallel. If the processing capabilities of the data 
flow computer are sufficient, the highest degree of paral- 
lelism may be achieved. 


The program is represented by a data flow graph. The 
vertices correspond to operators, and data tokens move 
along the arcs. Parts of the graph may have to be exe- 
cuted iteratively. This might cause tokens to accumulate 
on certain arcs and result in the presence of tokens 
belonging to different iteration steps at the input arcs of 
an operator. This problem may be solved by either label- 
ing (coloring) the tokens [4] or by preventing the accu- 
mulation altogether [3]. The latter is achieved by prevent- 
ing an operator from producing a new output token until 
the previous one has been consumed [3,8]. This approach 
still enables pipelining through the data flow graph. 
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Maximum pipelining is not however, always possible. 
Bottlenecks may appear in parallel segments of the graph 
[8] (e.g., paths of a conditional expression), thus severly 
limiting the amount of concurrency. To eliminate these 
bottlenecks an optimization technique has been sug- 
gested in [8], inserting buffers (delay operators) in some 
of the parallel paths. However, these buffers may result in 
either an increase in the overall delay through the pipe- 
line or a reduction in the throughput [8]. 


In the architecture suggested here dynamic length 
FIFO queues are employed. In this way, the level of con- 
currency is increased without the penalty of an increase 
in the overall delay. The labeling scheme as presented in 
[4] might be inappropriate for our purposes due to the 
additional hardware complexity. 


3. PE ARCHITECTURE AND PRINCIPLES OF OPERATION 


The basic PE, shown in Figure 1, is connected to its 
six immediate neighbors by dedicated busses, in an hex- 
agonal processor array. The PE contains six registers 

Zo. 2%,,°°',2Z5 which are connected to the six commun- 
ication busses. Each of these registers can either receive 
or transmit tokens and will accordingly be defined either 
as a primary input register or a primary output register. 


_ Each basic PE contains in addition, an arithmetic and 
logic unit (ALU) and a Pseudo Associative Memory unit 
(PAM). Each location within the PAM contains a key and a 
data element. The PAM has therefore, two input registers 
k, (key-in) and d,; (data-in), and two output registers k, 
and d, (Figure 1). 

The ALU is capable of performing the basic arith- 
metic and logic operations. Its inputs may be connected 
to any primary input register, or to the PAM data-out 
register d,. The ALU result is routed to either a primary 
output register or the PAM data-in register d,. 


The overall function of the PE is specified by the 
designation of each of the z,; registers as primary input or 
output register, by the internal connections of these 
registers to the ALU and the PAM unit registers, and by 
the operation performed by the ALU. Thus, the operation 
of a PE may be defined by a set of statements like 


dj:=27% +2, Zeid 
PE : Zgi=2, @s5:= Xo 
The PAM unit has two modes of operation, First-in 
First-out (FIFO) mode and Associative mode. In the FIFO 
mode the PAM unit serves as an input or output buffer for 
token accumulation. In the associative mode the PAM 
serves as a queue in which a key is attached to each data 
element. This mode of operation is useful when imple- 
menting recursion and in it data elements are accessed 
by using keys instead of addresses. However, a fully asso- 
ciative memory is not necessary. Instead, a sequential 
access memory unit with added logic can be employed, 
using shift registers, CCD’s, or magnetic bubbles. 


In the FIFO mode of operation the PAM unit may 
serve as a queue for accumulating either incoming or out- 
going tokens. The purpose of this FIFO queue is to dynami- 
cally equalize the length of parallel paths in the graph in 
order to achieve maximum pipelining. The fixed capacity 


of the PAM might limit the maximum length of the FIFO 
queue. However, the PAM units in two or more neighboring 


PEs may be chained and used for accumulating tokens 


from a single source. 


For the sake of brevity, the exact principles of opera- 
tion of the proposed PAM unit are not detailed here. 


4. IMPLEMENTING BASIC DATA-FLOW STRUCTURES 


_ This section shows how to use an array of PEs in the 
implementation of data-flow structures. We begin by exa- 
mining the basic data-flow elements, which can be 

directly mapped onto a single PE. These are the Arith- 
metic and Logical Operators (like addition, negation, And, 
complement, etc), and the Conditional Operators (like 
arithmetic comparisons, test for zero, etc). Also requir- 
ing only a single PE are the Flow Control Operators. They 
do not affect the contents of the token, but rather its pro- 
gress and/or destination. 


The simplest such operator is Copy (denoted by C). It 
makes two identical copies of its input token. 

Merge (M) operator is used when a data token may come 
from two different sources (paths), and it is to be merged 
into a single path for further processing. This operator is 
also capable of producing a Boolean token whose value 
depends on the input path which supplied the token for 
the output. 

The Router (R) operator receives two inputs, a data token 
and a logical value. The data token is copied into exactly 
one of two output registers, depending on the value of the 
logical input. This is analogous to "distribute" in [9]. 

The Gate (G) operator transfers the incoming token to an 
output register if a second token is present at another 
input register. The G and M operators may be used to 
implement the "Select" operation [9]. 

Self-Iterating Operator (L) is used in those cases where 
the result produced by the PE is immediately used as 
argument to the next operation. This saves the need to 
create "external" loop structures (e.g., [10]). 


Figure 2 depicts a data flow graph which calculates 
the factorial function, using C, M, R and L operators. 
Notice the labeling of the outputs of the R operator by T 
and F. This is used to specify which output path 
corresponds to each logical value. The JZ operator 
receives two inputs, one being the initial value for the 
iteration (in this case n), and the other a Boolean value 
which determines, for each iteration, whether to load a 
new initial value, or use the one from the latest iteration. 


Having defined the basic data-flow elements, we now 
show how these may be combined to yield the basic data- 
flow structures, 


4.1. Conditional (if-then-else) - 
general format: 


if. <condition> then <expression1> 
else <expression2>, 


This construct, has the 


endif. 

and is evaluated as <expressioni> or <expression2>, 
depending on the logical value of <condition>. The above 
statement may be implemented in general as shown in 
Figure 3. | 

Notice that when a certain branch of the conditional 
is taken, the tokens corresponding to the other branch 
are not produced at all. This is achieved by using an R 
operator with only one output; this way tokens 
corresponding to different computations are not mixed. 


_ We also deal here with the problem of keeping in 
correct sequence the results being produced by a condi- 
tional construct. The ordering is achieved by using an 
extra R and two Gs as shown in Figure 3. The initial token 
present at the R input is routed to the G in the appropri- 
ate branch of the conditional, thus allowing only its result 
to flow through. When a result arrives at the M, a token 
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(its value is immaterial) is recirculated to the R to enable 
further output tokens. 


Note that conditional constructs may result in token 
accumulation, because of different path lengths between 
the two branches. Here we can benefit from the PE's 
capability of dynamically adjusting to token traffic, by 
using the FIFO mode of the PAM. 


4.2. Iterative (Do-While, Repeat-Until) - Iterative data- 
flow constructs make use of conditionals much in the 
same way traditional programming languages do. In gen- 
eral, we have the two iterative constructs, do-while and 
repeat-until, depending on whether the test for loop 
repetition is placed before or after the loop body, respec- 
tively. Figure 4 shows how a repeat-until loop is used to 
approximate the square root of a (positive) value c, using 
Newton's iterative method. 


4.3. Recursion - This construct is by far the most 
involved in the data-flow context. Actually, most current 
data-flow architectures do not handle recursion at all. 
However, recursion is generally recognized as a good pro- 
gramming technique. When used, it leads in many cases to 
simpler and shorter algorithms which are easier to under- 
stand and to prove correct. 


For the sake of brevity we do not present here the 
way recursion is implemented, we would like however to 
indicate that any possible implementation of the recur- 
sion construct will be substantially more complex than all 
previous ones. The benefits of its use should therefore, be 
carefully examined before incorporating it. , 


5. DATA FLOW GRAPH MAPPING ALGORITHM 


In the following, we show a simple (by no means 
optimal) scheme for mapping complete data flow graphs 
onto an hexagonally connected PE array. The mapping 
algorithm presented is executed externally by some host, 
and the results are then fed into the array for distribu- 
tion. The graph mapping process is clearly dependent on 
the array topology. Therefore, different such topologies 
result in different mapping algorithms. Nevertheless, they 
all must tackle the same problem, namely, the non- 
planarity of the graph, arising from both ordering of 
operands and iterative constructs (loops). 


We begin by assigning levels to the vertices (opera- 
tors), where an operator (mapped on a PE) is at level z if 
all its operands come from operators at level 7-1 or 
above. Clearly, our objective is to find minimal levels for 
all operators. In the case of loops, we do not consider the 
target of the loop to be a descendant of the source. 


A second pass is now made to insure that no two 
operators which are either a loop source or target, are at 
the same level. If this is the case, the level is split until 
the condition no longer exists. The reason for this is to 
enable the use of the horizontal busses between PEs for 
connecting the source to the target. 


In the next step of the mapping we connect the 
operators in the various levels. The outputs of level 7 have 
to be ordered so as to fit the inputs to level i+1. 


After ordering the operands, (possibly by introducing 
extra levels which exchange operands), we connect the 
loop source with its target by using the horizontal connec- 
tions between PEs. We first route the operand from the 
source to the boundary of the current graph. Then, we 
route it to the level of the loop target. Finally, we use 
again the horizontal connections to reenter the graph up 
to the target operator. 


An example of the mapping of the factorial function 
from Figure 3 is shown in Figure 5(a). After the mapping 
process, has been completed, reduction techniques may 
be applied to reduce the size of the final mapping. For 
example, two levels may be collapsed into one, taking 
advantage of unused horizontal connections. Such a 


reduction procedure, when applied to the example in Fig- 
ure 5(a), results in the mapping shown in Figure 5(b). 


The mapping algorithm described above is by no 
means optimal and the only purpose it serves is to show 
the feasibility of mapping arbitrary data flow graphs onto 
hexagonal arrays. More efficient mapping algorithms for 
hexagonal arrays as well as for other array topologies are 
clearly needed. 


Once the mapping algorithm is completed, we have to 
convey its results toward every relevant PE in the array. 
A simple way of doing this is to input a "configuration 
string’, each component of which is addressed to a 
specific PE, and contains the setup information for it. 
Another possibility that is being investigated, is to exe- 
cute the mapping algorithm within the array itself in a 
distributive fashion.. This may enable a dynamic mapping, 
allowing PEs which have completed their current process- 
ing task, go into a configuring phase, change their func- 
tion and execute another part of the data flow graph. 


Such a dynamic mechanism may allow the mapping of 
larger data flow graphs on a given VLSI chip. It will also 
facilitate the handling of faulty PEs and/or connections. 


6. CONCLUSIONS 


The idea of directly mapping an arbitrary algorithm 
on a VLSI array has been shown to be feasible. However, 
further research has to be carried out before the 
effectiveness and practicality of this approach are esta- 
blished. 


Clearly, not all algorithms that can be mapped ona 
array will use it effectively. Some algorithms may require 
e too large chip area, other may not execute fast enough. 
Consequently, methods have to be developed for estimat- 
ing the chip area that will be used by a given algorithm, 
and its execution time. me 


Fig. 2: The data flow graph for the 
factorial function. tion. 


Fig. 5: Initial and reduced mappings of the factorial func- 
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repeat Zn+1= 3-(%m + 2) 


until |t,41 — Z| <6 


Fig. 4: The Newton method. 
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ABSTRACT: This paper deals with the problem of associ- 
ating the operations of a program with the processors in 
a distributed dataflow environment. The objective is to 
develop an algorithm which will partition a given 
dataflow program and map the blocks of the partition 
onto the processing elements of a general dataflow mul- 
tiprocessing system. System performance is measured 
by the total processing time of the program. 


INTRODUCTION 


The total execution time of a program can be bro- 
ken down into the actual computation time and the time 
for interprocessor data communication. When a pro- 
gram is partitioned among the processors of a dataflow 
system, the communication time is dependent on 
several factors such as the interprocessor connection 
network and the allocation scheme which maps the pro- 
gram operations onto the processing element. 


Different restricted versions of program decompo- 
sition and processor allocation schemes for dataflow 
programs have been proposed, [2], [3], [4], [5]. These 
proposed algorithms, however, are all limited in several 
respects which make them inapplicable to an actual 
dataflow system. These defficiencies are attributable to 
one or more assumptions or restrictions, such as: 


(a) 
(b) 
(c) 


The communication cost negligible or constant, 
A uniform computation time for all operations, 
A limited set of common language constructs, 


(d) Unlimited system resources such as the number of 
processors, and the memory sizes, and 
fe) Suppressing concurrency at lower levels. 


Our research attempts to develop a mapping 
scheme which provides a reasonably good solution to the 
main problem, and at the same time avoids the above 
mentioned restrictions. Our eventual goal is to establish 
algorithms which are implementable for a large class of 
realistic dataflow machines. 


THE PROCESSOR ALLOCATION PROBLEM AND 
SOLUTION STRATEGIES 


The general architecture studied is that of a 
machine consisting of a large number of small asynchro- 
nously operating processing elements which can com- 
municate with one another. Each processor in the sys- 
tem is capable of accepting a task generated by the pro- 
gram, performing the indicated operations, producing 
partial results, and transmitting these results to the 
other processing elements in the system. Figure 1 and 
2 show the general architecture of the dataflow mul- 
tiprocessor system. 


Dataflow programs are directly translatable into 
directed graphs because these simple structures are 
intuitively simple to understand and directly convey the 
program semantics. Such a program is partitioned into 
parts which are distributed to and stored in the process- 
ing elements. The intermediate results of these parti- 
tions of the program are retained in the local memory. 
Since the amount of communication may vary 
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Fig. 2 A Processing Element 


considerably from one pair of nodes to another, the way 
in which the nodes of the dataflow graph are allocated to 
processors can change the overall processing time of 
the program dramatically. 


Minimizing the interprocessor communication time 
and balancing processor loads are the two factors that 
influence the program partition strategy for optimal 
system performance. It is clear that they are also two 
conflicting factors. While the total execution time can 
be minimized by distributing the dataflow nodes on all 
the available processors, the total communication time 
is minimized by concentrating the nodes in as few pro- 
cessors as possible. 


The algorithm we propose maps the program nodes 
onto the processing elements by simulating a run time 
allocation environment during the compile phase. There 
are many criteria which may be used to decide how a 
node is to be aliocated to a processor. Some of these 
are: length of time an enabled node has been waiting for 
a processor; the number of output edges of the node; 
the length of time required for the primitive operation 
represented by the node; the position of the node in the 
graph ( as represented, for example, by the length of its 
longest output path ); the amount of data generated by 
the node ( which is directly related to the communica- 
tion cost ), or the number of data items at the input of 
the node. Besides these, there are other more complex 
criteria which may lead to more accurate estimation of 
the execution priority of each node. We have not 


considered them because we want the algorithm to have 
reasonable complexity. 


SUMMARY OF THE ALGORITHM 


The dataflow program to be _ partitioned is 
translated into a dataflow graph whose nodes represent 
the operators and edges correspond to the data depen- 
dencies among the instructions. Every edge is regarded 
as a channel through which data items carrying values 
from the output of a node to an input of another node. 
in order that an instruction be enabled as soon as its 
operands are available, the data dependencies of a 
dataflow graph must always be exactly the same as the 
sequencing constraints. Hence, applicative languages 
that obey the single assignment rule have been the basis 
of most of the proposed dataflow languages, [1]. In this 
paper, we too are assuming the use of such a language. 
Since the algorithm attempts to simulate a run-time 
status of the program, any node that is enabled for exe- 
cution is also enabled for allocation. The following infor- 
mation is required by the algorithm and hence must be 
Known in advance: 


(1) The number of available processing elements, and 
the processor type of each. 


(2) A table of interprocessor communication (IPC) 
times. This is necessary in order to take the cost of 
communication between processors into considera- 
tion. The IPC reflects the topology of the interpro- 


cessor communication network. 


The expected execution time of each instruction. 
Fach of these values is a constant throughout the 
program execution. The only exception may be the 
I/O operations which are non-deterministic in 
nature. The execution time of an I/O operation is 
modelled by a probability distribution function. 


Each individual procedure, iterative loop, and block 
structure has a unique code block name which is 
associated with every node within it so that nodes 
belonging to the same code block can be identified. 


The mapping of the nodes of a dataflow graph onto 
the memories of the processing elements is accom- 
plished in a single pass through the program graph. The 
algorithm maintains several pieces of data: 


(1) For each node n,, there is a token counter, TC, 
which indicates the number of tokens still needed 
before the node is enabled. Initially, TC(n,) is set 
equal to the indegree of n,. This TC is decremented 
every ltirme a token is supplied to the node through 
an edge that has no token on it. 


For each node n,, there is an enabled time ft, (n,). 
When the node is put in the set of firable nodes, this 
time is used to determine the order in which a node 
can be allocated to a processor. When the node is 
not yet in the set of firable nodes, the ¢,(n,) is 
updated each time TC(n,) is decremented to a 
value equai to the time a token is supplied to the 
node. Initially, f,(n,) = 0. 


(4) 


(2) 


‘3) There is a set of firable nodes, F, whose elements 
are ordered pairs (n,,t,(m,)). A node is inserted in 
this set if and only if its corresponding TC has a 
value of 0. 


For each processor, there is a ready time, ¢,. The 
value of f£,(p;) indicates the time at which the pro- 
cessor ~; will be idle next. Initially, t,.{p;) = 0. 


The algorithm can be analyzed by considering the 
information available at each node as it is enabled for 
allocation. Sinee each code block has its name, nodes 
within the block can be uniquely identified. The algo- 
rithm starts by traversing the graph from its outermost 
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block. As soon as a node (say, n; ) is enabled, which will 
be indicated by its token counter TC(n,) going to zero, 
that node is assigned to the "best suited” processor. 
The question being asked at this point is: which proces- 
sor can start the execution of the node 7, earliest if it 
were to be allocated to this processor? The answer 
involves checking the next ready time f,, of each pro- 
cessor while taking into consideration the communica- 
tion time between the target processor and the proces- 
sors to which the predecessor nodes of n, have been 
assigned. This allocation step is completed by updating 
the system status change caused by the allocation. 


When a node with a different code block name is 
encountered, the algorithm is applied recursively to 
allocate this inner block. This recursive step eventually 
ends in the innermost block and completes the alloca- 
tion as described above. The resulting schedule is 
passed back to the embedding block. Processors are 
then assigned to these schedules while preserving their 
relative start times. 


In order to handle control structures such as a pro- 
cedure call, a conditional construct or an iterative con- 
struct, special procedures are developed to reflect the 
effects of these various structures. These procedures, as 
well as the criteria used for tie-breaking situation when 
more than one node are available for allocation are sum- 
marized as follows: 


Note 2: Iterative constructs: the loep body is allo- 
cated as described above to a set of processors. The 
resulting schedule of each of these processor is treated 
as if it were the schedule of a single’ node. The start- 
ing time of these nodes” are compared to obtain the 
earliest starting time among themselves. Similarly, the 
latest finishing time can be determined. The difference 
between these two time values is used as the execution 
time of each of these "nodes" and is multiplied by the 
expected number of iterations. This parameter Is the 
final schedule for these processors. In the case when 
the number of iterations is a run time variable, a value 
is obtained from a probability distribution function. 


Note @: Conditional constructs: After the allocation 
of each block that corresponds to a condition, and for 
each processor which has been utilized in more than one 
of these blocks, the schedules of each block for this pro- 
cessor are compared and the one representing the 
“worst case’ execution time is used. This step is neces- 
sary in order to guarantee that sufficient processing 
time is reserved regardless of the condition selected 
during run time. 


Note 3: Procedure Call: The procedure, which is 
essentially a dataflow graph itself, is allocated as a code 
block. The algorithm maps the variables used and 
created in the procedure to those passed and returned 
in the procedure call. 


Note 4: The tie-breaking criteria used when more 
than one node are simultaneously enabled for allocation 
is the Length of the Longest Output Path, (LLOP). Bach 
node contributes a value of 1 to the length of the path in 
which it occurs except for the procedure node which 
contributes a value equal to the length of the longest 
path in the graph of the procedure called by the node. 
Clearly, since the program graph allows iterative con- 
structs, which may be executed a number of times, the 
length of the longest output path is not well defined. In 
this case, the longest output path obtained by travers- 
ing the loop once is multiplied by the expected numeber 
of iterations of the loop. Another criteria which has 
been considered is the largest outdegree of the succes- 
sors of a node, (LOD). It is used as a guidline because a 
node with a larger outdegree has a greater potential for 


enabling the firing of other nodes. As we shall discussed 
in the next section, results obtained by simulation indi- 
cate that LLOP yields a better performance than LOD in 
most cases. Therefore, perference is given to LLOP as 
the tie-breaking criteria in our algorithm. 


RESULTS AND COMPARISONS 


In order to evaluate the performance of the algo- 
rithm, simulation runs were made on data consisting of 
randomly generated program graphs. Programs with 10 
to 100 nodes, i to 5 levels of code blocks in a 3 to B pro- 
cessor system were simulated. The number of nodes in 
the graphs, the interprocessor communication costs, 
the individual node execution costs, and the number of 
active processors were all generated from uniform ran- 
dom distribution. The table below shows the distribution 
of the ratio A of execution time for the algorithm using 
two different "tie-breaking" measures (LLOP and LOD) to 
the dgecae execution time. 


of simulation results 


_Z. 1<R<11/10 | 11/10<R<5/4 


LLOP | 0% 


oe | ee eee eee 
trop esx | 19a 1 114 ae 


The most important property of the algorithm is its 
effect on the execution time of the program graph. A 
good algorithm will attempt to minimize the execution 
time regardless of the number of available processors. 
Figure 3 and 4 show the total execution time of é dif- 
ferent dataflow programs as a function of the number of 
processors for our algorithm and for the ones reported 
in [4]. The two programs chosen are a trapezoidal qua- 
drature program and a matrix multiplication program. 
The two appraoches of the algorithm in [4] are identified 
as MLC1 and MLCe2. The two versions of our algorithm are 
identified as LLOP and LOD. The best result is obtained 
using the LLOP version of the proposed algorithm in 
both the programs. While the algorithm MLC@2 gives a 
shorter execution time than the LOD version of our algo- 
rithm when there is a small number of processors 
increases. 


Exec. Cycle 


# of Processor 


Fig. 3 Aigorithm Performance with the Trapezoidal Program 
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Exec. Cycle 


# of Processors 


Fig. 4 Algorithm Performance with the Matrix 
Multiplication Program 


The effect is even more pronounced in the matrix 
multiplication program. The reason for this can be 
attributed to the fact that each node in both MLC1 and 
MLC2 algorithms represents a complete statement and 
hence the inherent parallelism of the statement is not 
taken advantage of. 


CONCLUSION 


The problem of decomposing and mapping a 
dataflow program graph onto a set of processing ele- 
ments is discussed. An algorithm has been developed 
for achieving efficient program execution while main- 
taining a high degree of concurrency. It is summarized 
in this paper. Research is proceeding for the investiga- 
tion of the effect on the algorithm of putting limiataions 
on several system resources. These include a finite 
memory size for each processing element, a communi- 
cation network that is not totally connected, and a set of 
nonhomogeneous processors each of which is character- 
ized by the functions it performs. 
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Abstract 


Dataflow processors show much promise for high-speed computation 
at reasonable cost, but they are not without problems. This short 


paper will discuss a processor design which combines ideas from 


dynamic dataflow architecture with those from reduced instruction set 
computers and proven large computers with parallel internal 
structures. The resulting processor includes a number of innovations, 
including operand destinations, killer tokens, I/O streams and closed- 
loop computation, which result in a small, relatively inexpensive 
processor capable of high-speed computation. The expected 
application areas of the processor include interactive computer 
graphics, signal processing, and artificial intelligence. 


1. Introduction 


Many new architectural proposals seem to assume the need for a new 
architecture. As a result, these designs end up as “solutions in search 
of a problem.“ This design started from the opposite viewpoint. We 
already had a problem -- the need for quick, easy to program 
interactive graphics. Its characteristics are common: the demand for 
fast computation at reasonable cost; use of algorithms that not only 
offer high degrees of parailelism, but are difficult to code 
sequentially; and a good fit into a message passing paradigm. The 
solution will involve some sort of parallel processor. 


Note that the expected application area was used to guide the design 
decisions, not to cripple the machine into special purpose oblivion. 
An example of this sort of decision is. the need to keep the processor 
as simple as possible; most graphics applications simply cannot afford 
a large, expensive processor. Another design decision was to avoid 
global bottlenecks such as centralized schedulers. As the design 
progressed it was decided to ignore questions of purity; issues such as 
whether it is possible to construct indeterminate graphs are best dealt 
with at the language level. 


2. Dynamic dataflow 


The architecture chosen was the dynamic dataflow architecture [Tr82] 
first proposed by Arvind. The dataflow machine running at 


Manchester is an example of this type of architecture, however, there. 


are still a number of problems to be solved [Ga82]. These problems 
can be broken down into three classes: problems with all parallel 
computers; problems with dataflow computers; and problems with 
current dynamic dataflow designs. 


An example of the first type of problem is processor scheduling. On 
any multiple processor machine individual instructions must be 
scheduled to specific processing elements for execution. If this 
scheduling is done at any time other than run time there is a danger 
that some elements will be idle while others are swamped. In order 
to solve this problem the scheduling must be done at run time, but 
this requires global knowledge of the dynamic state of the system.. 


Any global resource scheduler, however, will become a bottleneck on. 


a multiple processor machine. 
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An example of the second type of problem is the handling of data 
structures. The data storage of a conventional computer is arranged 
as a large array, so arrays have become the dominant data structure. 
used in computer science. In order to implement randomly updatable 
arrays on a dataflow machine, either the entire array must be passed 
around as the value of a token (i.e., array copying), or else it must 
be stored as a tree with the accompanying overhead on access. 
Dataflow machines are able to handle alternative data structures such 
as streams. It may be that our heavy dependence on arrays is an 
artifact of current practice, and other data structures might also be 
sufficiently powerful and efficient. 


Any new type of computer will have the third type of problem. For 
example, early von Neumann computers did not have index registers. 
This meant that address manipulation had to be done with self 
modifying code and other tricks, until someone thought up a better 
way to do it. Most of the design innovations in this paper are aimed 
at the third class of problem, but the first two types of problems will 
also be addressed. These problems include the following: 


e Conditional instructions affect data flow and not control flow. All 
values circulating in a loop, even loop constants, must pass through 
branch instructions. This degrades execution time as well as code 
density. 

e A further problem with dynamic dataflow designs is that all values 
circulating in a loop must have their tags updated. 

e Conventional processors can take advantage of program locality by 
holding data and status in registers. The tradeoff is that this 
information must be saved during a context switch. Dataflow 
processors have no overhead on context switches, but as a result 
have difficulty utilizing program locality. 

e Functional architectures such as dataflow have conceptual problems 
with nondeterminate and history sensitive operations, especially /O 
operations. 

e In many parallel processor designs there is much duplication of 
hardware, or hardware that cannot be utilized concurrently. 

e Most proposed parallel processors perform poorly under low 
degrees of parallelism. One of the strengths of the Cray-1 was that 
it performed scalar operations reasonably well, not just vector 
computations. 

e Some parallel processors depend upon massive amounts of 

parallelism, in the thousands or even millions, to realize their full 

performance. By studying the flow graphs from optimizing 
compilers we can see that most programs tend to have a parallelism 
of between 30 and 100 [TI80] [Fi82]. Impressive exceptions do 
exist, but they too will have bottlenecks where the parallelism is not 
quite as impressive. It is not even clear whether a computer with 
millions or even thousands of processing elements could be built. 

Certainly it would be difficult to keep such a system running for 

any length of time. 

In a parallel processor, if one process is producing values to be 

consumed by another process, the producer can get far ahead of 

the consumer and fill up the interprocessor queues unless some sort 
of handshaking is used between the processes or processors. 

e As discussed earlier, there are problems with instruction scheduling 
and data structures. 
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3. Processor Structure 


The processor is divided into a number of separate processing 
elements connected through a network. In order to avoid the 
duplication of hardware, and to keep the processing elements simple, 
different elements of the processor may have different instruction 
sets. Currently there are five types of processing elements (see 


below), although more could be added at any time. A processor can 


contain as many as 32 elements, but there must be at least one of 
each type. 

Instructions can take either one or two operands, and those that take 
two require a matching buffer. These buffers are expensive, so on 
this processor the processing elements are divided into those whose 
instructions take one operand, and those that take two. This allows 
the matching buffers to be eliminated on elements that only execute 
monadic instructions, as well as the field in a token that indicates the 
number of operands required by its instruction. 


Each processing element then consists of two or three stages: an 
optional matching buffer, an instruction fetch stage, and an execution 
unit. The instruction fetch section fetches the appropriate instruction 
from a conventional memory, and the execution unit performs the 
desired operation, and updates the output tokens. 


4. Tokens 
32 14 1 


5 12 
7 Se 


Tokens consist of a DATA field and a tag, both of which are 32 bits 
long. The tag field is further divided into several fields: the IADDR 
field gives the address of the destination instruction; the ITER field is 
used to keep up to 32 active iterations of a loop separate; the INVN 
field is used to specify which invocation of a subroutine this token is 
from; and the P (or PORT) field tells two operand instructions 
whether this is the left or right argument, or for one operand 
instructions it can indicate the end of a stream. Matching buffers 
match tokens on all bits of the tag except for the PORT bit. 


A umique feature of this processor is the use of killer tokens to 
remove tokens from the matching buffers. Killer tokens can be used 
to imcrease parallelism in loops, perform elementary stream 
operations, and form the basis for non-determinate computation. 
For ail this utility they are extremely easy to implement. A killer. 
token is a token that has the same tag as the token it is to kill, and 
can be created by any instruction. If two tokens match in a matching 
buffer and their PORT bits are the same, then both are discarded. 
This situation corresponds to the illegal condition of having, say, two 


right operands for the same instruction, and so does not require any 


extra bits. Even so, it is very general, since killer tokens can arrive 
before the token to be killed, and killer tokens can even kill other 
killer tokens. The uses of killer tokens will be discussed later. 


5. Instructions 


An instruction consists of a 32 bit operation phrase (opcode plus 
literals) followed by one or more 32 bit destination phrases. The 
ability to have an arbitrary number of destinations eliminates the 
need for explicit copy instructions, as well as the waste when space is 
provided for a fixed number of destinations and fewer are needed. 


Destination phrases on this processor are different from those in other 
dataflow designs in that they include fields to control data branching, 
token relabeling, output data source, and token priority, in addition 
to the fields containing routing information for the new token. A 
destination phrase consists of the following fields: 


bits contents 
1 Since there can be an arbitrary number of destination 
phrases, the last destination has this bit set, otherwise 
it is cleared. 


field 
LAST 


6.5 Input/Output Element. 


SOURCE 2 The source of the DATA field for the outgoing token 

can be either of up to two results of an operation 

(i.c., low order or high order product for 

multiplication), or either of the operands. For single. 

operand instructions, the data source can be the 

single operand, or one of up to three results. 

An operation may have a condition associated with 

it. This field tells the processing element whether to 

send this token out: regardless of the condition; only 

if the condition is true; only if the condition is false; 

or only if the processor is in debug mode. 

Tells how much to increment the ITER field of the 

token before sending it out. This is used to send 

values from one iteration of a loop to another. 

If this bit is clear, the addition to the ITER field by 

the INCR field is done modulo 32. If set, overflow 

of the ITER field increments the INVN field. 

PE 5 Processing element number to send this token to. 

IADDR 14 New destination instruction address. 

PORT 1 New PORT bit. Invert for a killer token. 

PRIO 1 If set, then this token has priority in the 
communication network. 


6. The Processing Elements 


6.1 Arithmetic Processing Element. The arithmetic processing 
element does simple arithmetic and logical operations such as adds, 
exclusive ors, comparisons, and shifts. Combined with the SENSE 
field these instructions can be fancy loop closers in that arithmetic, 
test, and branch operations are combined in the same instruction. 


6.2 Multiply and Divide Processing Element. This processing 
element is based on some monolithic multiplication chip, with a 
standard iterative algorithm used for division. 

6.3 Constant Element. Constants too big to fit in the operation. 
phrase of the arithmetic element, as well as constants for other 
operations must be generated explicitly by this monadic processing 
element. It also has the ability to do lookups into tables of constants, 
which can be used for sine and cosine tables, for example. This 
element can also be used to watch for the last token of a stream. 


6.4 Context Element. The context element executes instructions: 
which manipulate the INVN field of a token. It is thus used for 
subroutine calls and returns, but, unlike conventional machines, 
instructions are not associated with any control flow, they are. 
associated only with the parameters passed and returned. 


The interface between the history 
sensitive outside world and this applicative processor is done by the 
VO element using streams. The I/O processor contains a block of 
random access memory called the buffer memory. This buffer 
memory is mapped into the address space of the host, or if there is no 
host, it can be used as I/O buffers for DMA I/O devices such as disks, 
tapes, or frame buffers. Streams of data are written into and read | 
out of the buffer memory. 


INCR 5 


SMODE 1 


7. Operand Destinations and Conditions 


This processor uses the SOURCE and INCR fields to avoid passing 
values circulating in a loop through branch or tag update instructions. 
The INCR field updates the ITER field to send its destination to the 
next iteration of a loop. Loop constants can be specified as operand 
destinations using the SOURCE field to automatically send the 
constant to the next iteration. Unfortunately, this constant will be. 
sent to the next iteration regardless of whether that iteration will ever 
occur, so there will be a token left over. A loop must have at least. 
one branch instruction, and it can be used to generate a killer token’ 
for the leftover constant. As a result, only a single branch instruction 
is needed for any loop, just as in conventional architectures. Using’ 
the SENSE field very compact loops can result. For example, an 
iterative factorial program on this processor is two instructions, a 
multiply and a decrement and branch. 
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8. Streams 


One of the more popular features of the UNIX operating system 
[Ri78] is the ability to make a connection between the input and 


output of two concurrently running processes using a mechanism 


called a pipe. This allows concurrent computation, although on a 


uniprocessor the concurrency is simulated using an I/O buffer and. 


multiprogramming. Streams on this processor perform a similar 
function, but since there is no extra overhead associated with process 
switching processes can be made much smaller, and streams become 
much more pervasive. Even loops can be considered as processes 
operating on streams of data, and so streams become the major data 
structuring facility of this processor. 


Streams can also be used to do I/O concurrently with processing. In 
order to keep read operations from generating too many tokens and 
overflowing the ITER field, all operations are done on a closed-loop 
basis. Write operations generate a dummy result token that can be 
sent to a read operation to enable the next read. This scheme works 
in much the same way as acknowledgement tokens on static dataflow 
processors, except that they are explicit in the program, and are used 
around entire computations, not for every instruction. 


Large arrays of data, such as pixel arrays in a frame buffer, can be 
read and written using streams. To do reads from a frame buffer, a 
stream of pixel addresses is sent to the frame buffer, and a stream of 
pixel values is returned. If read and write operations to the frame 
buffer are mixed then it is up to the user (or the compiler) to insure 
that there are explicit data dependencies to keep proper order to the 
operations. Since all operations are done on a closed-loop basis, this 
is fairly easy to guarantee. This method can be used for any array of 
data, not just pixels, using the buffer memory in the I/O processor. 


9. Resource Control 


Functional architectures usually have problems with history sensitive 
and non-determinate computations. Most history sensitive 
computations can be handled using streams. For example, if a 
Tesource is being requested by a stream of incoming requests, but 
only one of them is allowed access at a time, the requests can be 
made to queue up in a matching buffer of a two operand element. 
When the present request is finished using the resource, the requisite 
matching token for the next request is sent to the matching buffer. 


This mechanism is not always enough; consider the situation where a 
request for a resource has been made, but the process requesting the 
resource changes its mind and wants to abort the request. This is 
done using killer tokens. The process sends two tokens containing 
null requests to the resource request queue. If the request has already 
been granted access to the resource, then the two tokens will act as 
killer tokens and kill each other. If not, the first token to arrive will 
act as a killer token for the original request, and the second will wait 
for the request acknowledgement token, and then perform the null 
operation. In either case, the resource will correctly enable the next 
request, and will send back some result (either the original one or a 
null), which the requesting process can ignore. 


10. Program Locality and Parallelism 
The PRIO bit in a destination is used to indicate computations that 


are on the critical path. This is used to give a token priority in the 
communications network. The matching buffers also contain a small. 


set of high-speed registers, much like registers on a conventional 
processor. The PRIO bit is used to indicate tokens which are given 
priority in these registers. If a token enters the matching buffer stage. 
of a processing element and the instruction fetch section is free, then 
an instruction prefetch can be initiated using the IADDR field of the 
incoming token. If the match succeeds, then the instruction is 
available immediately. These features improve performance under a 
low degree of parallelism, and allow the processor to take advantage 
of program locality. This results in a processor which does not 
depend on large amounts of parallelism for impressive performance. 


11. Supervisory Functions 


INVN value zero is reserved for supervisory functions such as loading: 
the instruction memories (there is also an I/O instruction to do this), 
clearing the matching buffers, and monitoring processing element 
load and utilization of matching buffer space. By monitoring system 
load, the operating system can shift instructions around when 
imbalances occur, while avoiding the global bottleneck of an 
instruction scheduler. The debug mode of the SENSE field can be 
used to count the number of times an instruction or loop is executed, 
print intermediate values, and other functions useful in debugging. 


12. Conclusions and Status 


This paper presents a dataflow design that has been evolving over the 
last several years [Le81]. It offers new solutions to many of the 
problems associated with dataflow processors. The design is currently 
running on a simulator which has been used to verify that the 
schemes proposed here actually work and can be used in small 
programs. To the best of my knowledge, the following features are 
unique to this processor: 


© Non-homogeneity of the processor. Increases modularity, 
eliminates matching buffers m some elements, ALU’s in others,. 
simplifies control and instruction 

e Operand destinations with automatic token relabeing. Elhminates 
extra branch and relabeling instructions. Only explicit relabeling 
instructions are for parameters passed to subroutines. 

e Killer tokens. Clean up after operand destinations, perform stream 
operations, help non-determinate computations. 

e Loop closing instructions for fast execution. 

e Ability to take advantage of program locality and give priority to 
computations on the critical path. 

e Instruction prefetching and matching buffer registers. Improves 
performance when parallelism is low. 

e Arbitrary number of destinations per instruction. Eliminates need. 
for copy instructions. 

e Elimination of explicit branch instructions by associating a 
condition with most operations. 

e Ability to stream I/O directly from peripheral devices such as disks 
and tapes with no host intervention. 

e Computations are done on a closed-loop basis, which keeps 
producer processes from racing too far ahead of their consumers. 


In my opinion, the problems plaguing current dataflow designs will 
turn out to be no more severe than the problems that plagued early 
von Neumann computers. This paper proposed solutions to some of 
these problems, and solutions to other problems will undoubtably be 
discovered as more dataflow processors are built and experience with 
them is gained. One way to help make sure that dataflow processors. 
will be built is to design them for application areas where traditional 
architectures are weak. This processor is aimed at such areas. 
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Abstract 


Classical signal processor design techniques are very expensive, 
time consuming, and result in a custom hardware and software 
that may not be capable of meeting the wide ranging require- 
ments of signal processing applications other than the one for 
which it was intended. 


This paper presents an organization of a Programmable Modular 
Signal Processor. A data flow concept of control is advocated in 
order to take advantage of the run-time parallelism inherent in 
most applications. The attractive features of the system are that 
it is capable of being realized with a small number of hardware 
functional units, and it allows the hardware to be independent of 
the signal processing application. The system is organized to sup- 
port a high-level graph-oriented signal processing application de- 
scription capability in order to simplify the user interface. 


INTRODUCTION 


Many areas of scientific computation — aerodynamic simulation, 
weather forecasting, real-time radar signal processing — have 
immense computational requirements with instruction execution 
rates greater than 2 x 10° operations per second. Conventional 
computer structures are unable to meet the demand because they 
cannot exploit the high degree of parallelism exhibited by most 
applications. 


Considerable research has been done in parallel computer 
structures and their suitability for applications exhibiting high 
degree of concurrency. The two basic structures are parallel 
array and the pipeline. The parallel arrays (array processors, 
such as ILLIAC IV, and multiprocessors, such as C.mmp) have 
various problems such as job partitioning and memory con- 
flicts. There is overwhelming evidence that only a fraction of 
their potential can be realized for a broad spectrum of applica- 
tions [1,2,3]. The pipeline structures (IBM 360/91, CDC-STAR, 
CRAY1), although capable of high performance, are prone to 
extremely low performance when the pipeline flow is disrupted 
by branching, non-availability of data, or interdependence be- 
tween stages. 


The data flow concept promotes a fundamentally different way of 
executing instructions — an alternative to sequential program 
execution in conventional computer structures. In a data flow 
machine, an instruction is ready for execution when its operands 
are available. No control flow is indicated either implicitly or 
explicitly and there are no program counters. The concept of data- 
activated instruction execution allows multiple instructions in a 
data flow program to be executed concurrently. This expression of 
parallelism in terms of data dependencies, rather than in spite of 
them, leads to a far more natural and flexible picture of parallel 
program execution [4]. 


In view of the nature of the parallel hardware structures and the 
practical difficulties encountered in running them at full speed, it 
seems likely that the data flow architectures, with their 
fundamentally different approach to instruction execution, have 
a better chance of exploiting the high degree of parallelism inher- 
ent in most applications. 
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In this paper a data flow architecture for a Programmable Modu- 
lar Signal Processor (PMSP) machine is presented. The PMSP 
system is being designed for solving a broad range of real-time 
signal processing applications, with particular emphasis to radar 
signal processing. The architecture is developed by repeated use 
of a small number of basic building blocks, which are: 


1. A Multiport Memory System (MMS) that features concur- 
rent access to memory at its multiple data ports, high 
throughput per port, and a programmable interconnection 
mechanism. 


2. A General Processing Element (GPE) that has scalar/ vector 
arithmetic processing capability, matrix arithmetic, trans- 
forms, and function generation capability. 


3. An Input/Output Processing Element (IOPE) to support all 
the input/output interfaces. 


These building blocks can support multiprocessor systems that 
have very high throughput and which are reliable and dynami- 
cally reconfigurable. The MMS, GPE and IOPE are coupled with 
a data flow control mechanism to provide the PMSP architecture 
with characteristics needed to support real-time radar signal 
processing applications. 


DATA FLOW ARCHITECTURES AND PMSP SYSTEM 
ORGANIZATION a 


The objective of this section is to introduce the basic data flow 
architecture and identify the modifications meeded to support a 
wide range of real-time signal processing applications. 


A number of data flow architectures have been proposed [5-9]. We 
have chosen the Manchester data flow model proposed by Gurd 
and Watson [8] to illustrate the approach and the modifications 
needed. The Manchester data flow machine has five units inter- 
connected (see Figure 1). The node store contains node descrip- 
tions, which specify the operation to be performed, and the node 
store address to which the results of the node execution must be 
directed. Data values circulate around the ring as tokens that 
consist of a data value and the address in the node store. When a 
token or a pair of tokens arrive at the node store, the correspond- 
ing instruction (node description) is fetched from the node store. 
The node description (instruction) when coupled with its tokens 
(operands) becomes an executable instruction that is routed to the 
processing subsystem. 


The matching store is an associative storage mechanism in which 
tokens directed to the same node are grouped together: The token 
queue is a buffer between the switch and the matching unit for 
equalizing disparities in the rate of production and consumption 
of tokens. The switch provides a means for communicating with 
the external world. 


The data flow mechanism may be viewed as a circular pipeline 

that carries tokens. The degree of concurrency possible is limited 
by two factors: 1) the number of processing elements in the proc- 
essor subsystem and the degree of pipelining within each unit, 


TOKEN 


EXECUTABLE 
INSTRUCTION 


PROCESSING 
SUBSYSTEM 


RESULT 
-, TOKENS 


SWITCH 


Figure 1. Basic Manchester Data-flow Architecture 


and 2) the capacity of the data paths connecting various subsys- 
tems in the ring. 


The suitability of a data flow model for radar signal processing 
applications has been studied [10]. The benefits of simultaneity in 
an irregular, and run-time dependent data flow situation, and 
also the software engineering benefits when several complex 
functions are to be executed in parallel are very attractive [11]. A 
radar signal processing application is typified by very high data 
arrival rates, high computational rates on large volumes of data, 
high reliability, and fault tolerance. In order to accommodate 
radar signal processing applications in data flow, a number of 
modifications are necessary. 


The first modification proposed is a higher level of data flow 
(macro data flow), which would synchronize operations at the 
functional level. The functions may be complex tasks on large 
vectors of data. The data flow concept described earlier, in con- 
trast, synchronizes operations at the arithmetic level. This modi- 
fication can minimize the token flow traffic on the circular pipe- 
line of the data flow machine. 


Second, the tokens in the macro data flow model would consist of 
a pointer to a value or set of values and also an address of the 
node store, instead of the conventional tokens, which carry a 
label, a value, and the address of the node store. By making this 
change, it would be possible to avoid saturating the data paths 
with data tokens (operands) intended for a single instruction. The 
time taken to route an instruction from node store to a processor 
in the processing subsystem would be quite smal! in comparison. 
Hence, the unused capacity of the data paths could be used to 
route more instructions to improve the concurrency to a level 
acceptable for real-time applications. 


Finally, the processing elements in the processing subsystem 
should be capable of executing a variety of complex tasks. The 
architecture of these processing elements, however, need not be 
data-flow oriented. Any of the conventional parallel structures — 
array, pipeline, or hybrid, for example — can be chosen in order to 
efficiently execute the tasks because the concurrency achieved by 
data flow at this level is offset by communications and control 
overhead. A complete discussion of data flow performance under 
various conditions is available in Reference [10]. 


PMSP Data Flow Processor (PDFP) Architecture 


The PMSP Data Flow Processor consists of five subsystems (see 
Figure 2): a Task Data Base, a Task Dispatcher, a Processor 
Subsystem, an Input/Output Subsystem, and a Task Scheduler. 
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The five subsystems are interconnected in a ring; the I/O Subsys- 
tem and the Processor Subsystem also interact with a multiport 
memory system. 


The task data base provides a central storage media for various 
tables describing graph nodes, node tasks, and the interconnec- 
tivity of the nodes. The data base also maintains status and capa- 
bility lists of the various resources in the Processing and I/O 
Subsystems. A Task Ready List indicating the order in which the 
tasks became ready to execute is also maintained in the task data 
base. 


The run-time function of the task dispatcher is to select the next 
task to be run from the Task Ready List, to select a processor from 
the I/O Subsystem or the Processor Subsystem upon which to 
execute the task, and to dispatch the corresponding task packet to 
the selected processor. The packet consists of a task identification 
followed by parameter, input, and output parameter information. 
The task dispatcher broadcasts the task packet on the Dispatcher 
Bus (DBUS). The task dispatcher also interfaces with the host, for 
purposes of statically loading the task data base with an applica- 
tion, and runtime control of the database by the host. 


The Processor Subsystem consists of a number of General Proc- 
essing Elements (GPEs), as shown in Figure 2b. The Dispatcher 
may select one of the GPEs to receive a task packet over the 
DBUS. The GPE retrieves the input operands from the Multiport 
Memory System (MMS) from the locations indicated by the task 
packet, performs the specified operation, and returns the data 
(output operands) generated to the MMS. (The GPE may have one 
or more ports to the MMS.) The GPE also generates a result 
packet consisting of a GPE-Identification, Task, and Task-status. 
The result packet is available to the Task Scheduler on the 
Scheduler Bus (SBUS), which is time multiplexed. The task 
scheduler generates the address of the next device that may place 
its result packet on the SBUS. 


Each GPE is capable of executing a number of most commonly 
used signal processing algorithms or primitives, for example, 
FFT, MTI, and CROSS-ADD. A library of these packages exists at 
each GPE; there is no run-time loading of packages. It is possible 
to have each GPE execute every signal processing primitive or to 
have GPEs with different primitive execution capabilities. A ca- 
pability list of each GPE is maintained in the task data base. 


The Input/Output Subsystem consists of several I/O Processing 
Elements (IOPE), each of which has at least two I/O ports, one for 
communicating with the multiport memory system and the other 
for communicating with devices external to the PDFP. A variety 
of external devices, such as ADCs, DACs, IOPE of another PDFP, 
may be attached to an IOPE. The dispatcher may select one of the 
IOPKs to receive a I/O task packet over the DBUS. The selected 
IOPE decodes the task and performs the specified data transfer 
operation between the MMS and the external devices attached to 
the IOPE. Upon completion of the I/O task, the IOPE generates a 
result package consisting of an IOPE identification, Task, and 
Task-status. The result packet is available to the task scheduler 
on the SBUS. Upon initialization of the PDFP, some of the IOPEs 
are dedicated to bringing in external data and starting the data 
flow application resident in the task data base. 


The task scheduler receives the result packets from the GPEs and 
IOPEs in a time-multiplexed fashion. From the result packet and 
the application graph tables residing in the task data base, the 
scheduler determines which nodes are ready for execution and 
places them at the back of the Task Ready List. 
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Figure 2a. PMSP Data Flow Processor Organization 
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Figure 2b. PMSP Data Flow Processor Organizational Details 


346 


The Multiport Memory System supports contention-free parallel 
access to the bulk memory at its multiple ports. The MMS pro- 
vides a programmable interconnection mechanism by supporting 
a set of high-level I/O commands at each port (see Table I) to 
interconnect bulk memory and an external device or any two 
external devices. The high speeds attainable at each port and 
programmable interconnection capability make MMS a useful 
tool in building reliable and adaptive multiprocessing systems for 
real-time applications. 


HOST BUS (HBUS) 


The bulk memory used in MMS is block oriented: the smallest 
unit of data transferred between the MMS and an external device 
attached to it is one block. A block is transferred word-serially. 
The block size varies linearly with the MMS configuration; that 
is, an MMS with 16 ports has a block size of 16 words. The block 
size is not considered a limitation because the class of applica- 
tions being solved by the PDFP require large volumes of data; 
therefore, an optimized block transfer approach, such as the one 
used by MMS, is preferred over other approaches. Details of MMS 
operation are beyond the scope of this paper and are available 
elsewhere [12]. 


INTERCONNECTION MECHANISM 


e USE MMS-FOR APPLICATION INDEPENDENCE 
¢ CUSTOM INTERCONNECTION FOR SMALL SYSTEMS 


Figure 3. PMSP Data Flow Multiprocessor Organization 


TABLE I. MSS COMMAND SET 


DEVICE-ADDR | BLOCK-COUNT REMARKS | 


BULK- #BLOCKS | READ SPECIFIED 


| OPERATION 


INPUT QUEUE 
(1Q) 


MEMORY # OF BLOCKS ee 
ADDRESS FROM BULK 
MEMORY AND WHERE 


TRANSFER TO 


RA — READ AMOUNT 
EXTERNAL DEVICE CA — CONSUME AMOUNT 
PA — PRODUCE AMOUNT 


WRITE SPECIFIED 

# OF BLOCKS 

TO BULK-MEMORY. 
THE SOURCE 

OF THESE BLOCKS IS 
THE EXTERNAL 
DEVICE 


# BLOCKS 


SEND PORT- # BLOCKS 
ADDRESS 

RECEIVE PORT- | # BLOCKS 
ADDRESS 


PMSP Data Flow Multiprocessor (PDFMP) Organization 
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The level of concurrency exploited by the PDFP may be increased 
enormously by connecting several PDFPs to form a multiproces- 
sor system. The organization of a PMSP Data Flow Multiproces- 
sor is illustrated in Figure 3. The PDFPs communicate with each 
other by way of their respective IOPs. The interconnection of the 
PDFPs may be made independent of application by using an 
MMS. For a small number of PDFPs, a direct interconnection of 
their respective IOPEs may be preferred. 


The PDFPs are connected to the host by means of a host bus 
(HBUS). The user may use the graph description capabilities ex- 
isting on the host to describe the various subgraphs that make up 
the application. The host assigns one subgraph to each PDFP and 
exercises overall control of the PDFPs. 


PROGRAMMING SUPPORT AND OPERATING SYSTEM 
REQUIREMENTS 


An application that can run on PDFP or PDFMP may be de- 
scribed as a data flow graph (see Figure 4). The nodes are the 
operators and the arcs are data queues. Operators may be large 
tasks, such as 1024 point FFT or matrix operation. The data 
queues are FIFO queues maintained in the Multiport Memory 
System. Each queue has a read amount (RA), a consume amount 
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Figure 4. Example of Graphical Description of Application 


(CA), and a produce amount (PA). The produce amount deter- 
mines the amount of data a node produces and places on its out- 
put queue. The read amount determines the amount of data 
needed from the queue for a node to start execution. The consume 
amount determines the exact amount of data consumed by the 
node. When a node has multiple input queues, it can start execu- 
tion only when all the input queues have enough data on them to 
equal or exceed their corresponding read amounts. 


The host system implements a PCOS interpreter [13], which ac- 
cepts graph descriptions as a set of directives and generates an 
object graph. An object graph is simply a collection of data sets 
indicating the various nodes in the graph and their connectivity 
as well as various queues in the graph and their characteristics 
(see Figure 5). 


Each GPE in the processing subsystem has an executive capable 
of decoding the task packet, fetching the input operands, and 
invoking the appropriate primitive routine to execute the task. 
Upon completion of the task it returns the output operands to the 
operand store and posts the result to the task scheduler. The most 
commonly encountered primitives may be coded and stored lo- 
cally in its program memory. 


In addition, it is possible to specify a combination of these primi- 
tives as a single task. For example, in the implementation of a 
radar MTI Process (Figure 6), a number of primitives (nodes) are 
interconnected as a subgraph to be executed by a GPE. Only the 
data produced by the peripheral nodes is transferred to the Multi- 
port Memory System, and the data produced by the intermediate 
nodes is retained in the GPE for use as inputs by the subsequent 
nodes in the subgraph. This technique significantly reduces I/O, 
promotes better GPE performance, and reduces the data queue 
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build-ups in the Multiport Memory System. With these capabili- 
ties, most of the complex tasks encountered in radar signal proc- 
essing can probably be specified directly by the user in the data 
flow graph, thus avoiding conventional software development. 


The task dispatcher has access to the object graph data sets (see 
Figure 5) and the task ready list from which it selects the task as 
well as the processor best suited to perform the task. The task 
scheduler accesses the object graph data sets to determine which 
nodes are ready to execute, and places them on the task ready list. 


CONCLUSIONS 


In this paper, the organization of a Programmable Modular Sig- 
nal processor that utilizes a data flow concept of control was pre- 
sented. Data flow architecture was chosen to take advantage of 
the parallelism inherent in most real-time applications. The at- 
tractive features of this approach include: 


a) Hardware that is independent of signal processing applica- 
tions 


b) Development of a relatively small number of hardware func- 
tional units 


c) Isolation of signal processing algorithm development from 
software development 


d) A mechanism for the development of a signal processing 
application comprised of a configurable set of standard hard- 
ware functional units 


e) A mechanism for merging data processing control functions 
written in a HOL at the host, and signal processing func- 
tions expressed in graph notation. 


The modifications to the conventional data flow architectures en- 
ables very high performance data flow machines to be developed. 
Having eliminated the data token traffic from the circular pipe- 
line, the dispatcher can dispatch the task packets to the proces- 
sors faster than normally possible. The scheduler can also per- 
form faster because it does not have to deal with large number of 
data tokens. Performance can be further improved by segmenting 
the dispatching and scheduling operations and pipelining them. 


Preliminary analysis of the architecture gives a strong indication 
that it is suitable for real-time signal processing applications. 
Detailed simulation modelling and performance evaluation un- 
der various conditions for a variety of applications is in progress. 
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A SIMULATOR FOR MIMD PERFORMANCE PREDICTION-- 
APPLICATION TO THE S-1 MkIIa MULTIPROCESSOR(@) 


T. S. Axelrod, P. F. Dubois, P. G. Eltgroth 
Theoretical Physics Division 
Mathematics and Statistics Division 
Lawrence Livermore National Laboratory 
Livermore, California 94550 


Abstract -- We describe a MIMD multiprocessor language constructs. Our simulator, called MPSIM, 
simulator and application of that simulator to a provides a readily available tool for gaining at 
multiprocessor of current interest, the S~l least some of this experience. One crucial feature 
MkIIa. The simulator runs on the Cray-1 and is of the simulator is that the algorithm being 
designed so that computational physics benchmarks investigated is actually run in all its detail so 
are actually run and produce results. Simulator that its numerical behavior (e.g., stability, 
output from this run are fed into a second level correctness of answers) can be observed. Among 
(hardware) simulator which calculates the behavior Other things, this allows many bugs related to 
of the multiprocessor. The simulator can simulate multiprocessing to be found. 
multiprocessors whose basic architecture is that 
of a few, large processors with or without data 2. The Trace-driven Two Level Methodology 
caches, sharing global memory through an 
interconnection switch. The simulator is applied Simulation under MPSIM occurs at two levels 
to investigate the behavior of SIMPLE, a benchmark (MPSIM-1 and MPSIM-2). These two levels represent 
physics code. the software and hardware, respectively, of the 

I. Introduction integrated system being modelled. This two level 
Sa structure was inspired by the work of L. Cox 
Our purpose in this paper is twofold. We [Cox78]. At the first level of simulation we deal 


begin by describing a simulator we have created with autonomous instruction streams referred to as 


for predicting the performance of realistic physics processes, without reference to any physical 
calculations performed on multiprocessors. We then Fk sous sie while at the second level we deal 
describe the physics code SIMPLE and present with physical processors. At present we assume 
simulator results for its performance on a multi- that there is a one-to-one mapping between . 
processor of current interest, the S-1l MkIIa processes and processors, but this assumption is 
[S-179]. The simulator is available for use on BADE CS ost ve and ai teccs Cul nee tie aus 
Cray-l computers under the CTSS operating system, extension of the simulator to pctude machines 
and a user's manual has been previously published such as the Denelcor HEP [Smi78], which has 


[Axe82]. multiple processes per processor, is at least in 
principle straightforward. 
II. Simulation Methodology The two levels of the simulation (which can 
be run independently) are used for different 
1. Goals of the Simulator ~ purposes. At the first level of simulation, we 
are concerned primarily with the correct logical 
Simulation of computer systems is performed operation of a program, and the details of a 
for a great variety of purposes. Among these are particular multiprocessor implementation (such as 
gate level simulations to verify logical the relative speeds of private and shared memory, 
correctness of a design, register level simulations the speeds of various synchronization operations, 
which allow the effects of instruction sequences and so on) may be mostly ignored. On the other 
to be determined, and queuing models, which are hand, the performance of a multiprocessor program 
most commonly employed to predict performance of may depend critically on the details of the 
computer systems under timesharing loads. The implementation, and the second level of the 
simulations described here fall in a less common | simulation, which is driven by output. from the 
category. Our goal is to predict the performance first level, is intended to predict this 
of an MIMD computer in solving large computational performance. In general, performance information 
physics problems by some specified algorithm. will dictate changes in the program, which are 
Our simulation has an additional goal of incorporated by returning to the first level. 
nearly equal importance. Since usable and MPSIM-1 is an extended version of a 
accessible MIMD machines are still very rare, Simulator written by L. Sloan. It runs on the 
computational physicists, numerical analysts, and Cray~l under CTSS and consists principally of a 
programmers have little or no experience with the process scheduler and the machinery and data 
"parallel world". Not only is a new set of structures to save and restore process state 
performance related issues present, but there are information. The simulator provides the FORTRAN 
new issues of logical correctness and choice of user with a number ‘of subroutines which allow the 
Se ee ee creation and destruction of processes and implement 
(a) Work performed under the auspices of the a variety of synchronization operations. A brief 
U. S. Department of Energy by the Lawrence summary of the available functions is given in 
Livermore National Laboratory under contract Table 1. A complete description is available in 
No. W-7405-ENG-48 .. | [Axe82]. 
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Table 1. Summary of MPSIM-1 Subroutines. 

Name Purpose 

MPINIT Initialize MPSIM-1 

TRCINIT Begin trace output for MPSIM~2 
FORK Start an additional process 
FORKOFF Start multiple additional processes 
SYNCAL Global synchronization barrier 
PSEM P operation on a semaphore 

VSEM V operation on a semaphore 

PAWS Wake up the MPSIM-1 scheduler 
JOINAL Terminate all but 1 process 
PRCEND Terminate a process 

TRCFIN Terminate trace output for MPSIM-2 
MPFINI Terminate MPSIM-1 operation 


In operation the Level 1 simulator is a 
timesharing system in miniature. A single Cray-1l 
is multiplexed among the currently runnable 
processes and a simulated wall-clock is 
maintained. The level 1 simulator is in facta 
simulation of a particular MIMD machine--a highly 
idealized multiple Cray-1 with both shared and 
local memory. | 

Clearly most of the characteristics of a 
parallel algorithm which will determine its 
performance on a real MIMD machine are contained 
in the details of its behavior on this idealized 
MIMD machine. Not only is the pattern of process 
creation, destruction, and communication present, 
but quite complete information is also available 
on the actual computation performed by each 
process, much of which is hardware independent. 
The linkage between the level 1 and level 2 
Simulator consists of abstracting the performance 
related information from the level 1 behavior and 
transferring it to level 2 where it may be 
interpreted in the context of a more realistic 
(and possibly quite different) MIMD machine. 

What should actually be included in the 
information transmitted to level 2? One extreme 
approach would be to include the complete state of 
There 

The 


the simulated MP-Cray on every clock cycle. 
are two serious problems with this, however. 
quantity of information is several orders of 
magnitude too large to permit the behavior of a 
complete physics algorithm to be practically stored 
or processed. Even more serious, perhaps, is the 
fact that most of the information generated has no 
obvious relevance to machines other than a Cray-l 
or a very close relative. 

At the opposite extreme, one might simply 
count arithmetic operations performed by each 
process. This information is so incomplete, 
however, that the hardware model contained in 
Level 2 must of necessity be quite simple. [In 
particular, details of interprocessor interactions 
which arise from memory conflicts and cache 
coherence problems cannot be modeled. 

In general the choice of which information 
should be abstracted from the Level 1 behavior is 
partially subjective and must be based on a 
careful assessment of the characteristics of the 
system being modelled, and the size of the 
computational resources available for the 
simulation. The nature of the choice we have made 
for modelling the S-1 MkIIa is discussed in 
Section III. For the moment we merely note that 
it falls between the two extremes, preserving the 
details of each process' data referencing patterns 
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and the arithmetic operations it performs, while 
neglecting many details of address calculations 
and instruction referencing patterns. 

During the operation of MPSIM-1 the 
information needed by MPSIM-2 is gathered by 
machine instructions which have been automatically 
inserted into the user's object code. These 
instructions cause control to be temporarily 
transferred to simulator routines which write the 
desired machine state information to disc files, 
one for each process. This information gathering 
process is invisible to the user except for 
increased CPU charges. 

MPSIM-2 views the events contained in each 
process’ trace stream as requests for hardware 
services by a running process. Typically the 
services required are arithmetic operations and 
transfers of operands. Each request is satisfied 
as soon as possible within the constraints of the 
hardware model. We note that this may result ina 
time ordering for events in separate process 
streams that differs from that of MPSIM-l. 
ordering within a single process stream is 
preserved (unless the MPSIM-2 hardware model 
allows out-of-order execution), as are the 
interprocess orderings enforced by synchronization. 

The output of MPSIM-2 is a detailed record of 
event histories for each of the simulated 
processors, which in practice are usually viewed 
in the form of graphical plots. 


Event 


III. THE S-1l MkIIa Hardware Model 

As the first application of our simulation 
techniques we chose to investigate the S-1]1 MkIIa 
multiprocessor. This machine is of great interest 
due to its imminent availability and supercomputer 
performance potential. Additionally, its design 
explores for the first time the use of cache 
memory in high speed multiprocessors. 

In this section we describe the hardware 
model we have incorporated in SISIM-2 to model the 
S-l MkIIa multiprocessor. When we created the 
model, the S-1 MkIIa implementation was far from 
complete, anda number of details were uncertain. 
This was particularly the case for the main 
memory, crossbar switch, and microcode used for 
interprocessor communication. Since hardware 
documentation was generally unavailable we have 
relied on conversation with S-l project members 
[Far82] to fill in the gaps where possible. How 
successful we have been in modelling the S-1l MkIIa 
as it is finally implemented will not be known 
until the machine is available for testing. We 
nonetheless are presenting the model and the 
results from it in the hope that it will be a 
generally useful contribution to performance 
assessment of MIMD machines. 
1. Summary of S-1 hardware 
The S-l MkIIa multiprocessor 
[S-179,Far80,Far81] consists of up to 16 
processors connected to a similar number of memory 
banks by a crossbar switch. Each processor is 
extensively pipelined and microcoded and possesses 
the following major resources: . 

1. Instruction cache (4k words). 

2. Data cache (16k words). 


3. Address arithmetic unit. 

4. Floating point adder. 

5. Floating point multiplier. 

6. 16 general purpose register files with 32 


words each. 

7. Local memory (1M word class). 

The S-l pipeline is partitioned into two 
major units. The IBOX fetches and decodes 
instructions and handles all tasks associated with 
the fetching and storing of operands, including 
management of the cache, mapping of virtual to 
physical addresses, and memory protection. The 
ABOX is responsible for the remaining steps in 
instruction execution and calculates all results 
which will be stored in the register files or 
memory. 

The processors are implemented with ECL 10K 
and 100K integrated circuits, while the local and 
global memories are implemented with 64kb MOS 
chips. The design cycle time of the processor is 
50 ns for the instruction fetch and decode unit 
(IBOX) and 25 ns for the arithmetic unit (ABOX). 
The word size is 36 bits. 

The ABOX adder and multiplier are innovative 
in several respects [Far81]. They have been 
designed for low latency and are partitionable to 
allow calculation with operands of 18, 36, and 72 
bits. Special attention has been devoted to 
achieving high speed on FFT's and other 
mathematical functions. In part for this reason 
the adder produces both the sum and difference of 
its operands simultaneously. 

The instruction set is extensive, providing a 
“wide variety of addressing modes, dyadic and 
triadic vector operations, and evaluation of 
mathematical functions among its most notable 
features. 

Since the MPSIM-2 hardware model is driven by 
a trace stream obtained from a Cray-l, it is 
appropriate to contrast the S-1 MkIIa design with 
that of the Cray-l [Cra76]. The most important 
differences include the following: 

1. The design of the memory hierarchies 
differ greatly. The S-l is a virtual address 
machine which utilizes a combination of fast ECL 
data and instruction caches, very large MOS main 
memory, and disk for demand paging. The Cray-1 is 
much simpler, relying on ECL technology for both 
its registers and 16 interleaved banks of main 
memory. 

2. The S-1l has fewer functional units. All 
instructions must pass through either the adder or 
multiplier in the ABOX. On the Cray-1 there are 
thirteen functional units (including memory) which 
may execute instructions. 

3. Vector operands on the S-1 are obtained 
directly from memory, while on the Cray-1l they are 
held in vector registers for use by the vector 
functional units. The vector stride must be 1 for 
the S-l, while it may be any value on the Cray. 

4. The S-1l instruction set is more powerful 
than that of the Cray, so that multiple Cray 
instructions can often be replaced by a single S-l 
instruction. The most common occurrence of this 
is in operand address computation, but there are 
many examples. 

5. On the S-1, results are produced strictly 
in the order implied by the instruction order. On 
the Cray-1l, although instructions are always 
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issued in order, results may be produced out of 
order and by any of the functional units. 

Both the number of instructions and the 
number of machine cycles devoted to address 
arithmetic are likely to differ greatly between 
the two machines. In favorable cases, however, 
the time taken to perform these operations is 
completely "hidden" through overlap with floating 
point operations and memory references. The same 
Situation holds for conditional branches. Both 
machines are able to issue instructions at a 
maximum rate of one per cycle (12.5 ns Cray-l, 

50 ns S-1 MkIIa). How closely this goal is 
approached is very sensitive to the optimization 
techniques employed by the compiler. 

As sketched above, the S-1 MkIIa processor is 
highly complex. Our simulation models a small 
carefully chosen subset of its features. In most 
cases the model assumes that omitted features 
(e.g., prediction of conditional branches) work 
perfectly, so that the simulation results form an 
upper bound on performance, but there are 
exceptions. 

In selecting the hardware features to be 
included in the model we began by recognizing 
the effectiveness of the data cache will be a 
major determinant of performance. Occurrence 
data cache miss stops the IBOX pipeline until the 
required data can be obtained. Since main memory 
is much slower than the IBOX cycle time, data 
cache misses can easily impose a limit on 
performance. This is especially the case when 
processors share data so that cache coherence 
problems arise. The model must therefore be able 
to determine the contents of the data cache of 
each processor at every stage of the computation. 

On the other hand, we expect the 
effectiveness of the instruction cache to be 
generally high and of less importance in 
determining performance. This arises from both 
the generally much smaller size of total program 
instructions relative to cache size, and from the 
generally much higher localities observed for 
program instruction references compared to data 
references. This is most fortunate, since the 
instruction cache hit rate could not be modelled 
accurately without actual S-1l instruction streams 
for the computation. These will not be usefully 
available until a vectorizing compiler for the 
machine is completed. 

Our model of the ABOX is quite simplified, 
Since it ignores delays which result from data 
dependencies between operations. If all required 
operands are in cache the simulated ABOX is ready 
to execute a new instruction a fixed time 
(Tissue) after beginning the previous 
instruction. We have chosen Tjiggye to be the 
shortest possible--those obtained in the absence 
of data dependencies. This is not a necessary 
restriction on the model, since all data 
dependencies are in fact available in the trace 
stream. Again, our results are expected to form 
an upper bound on performance. 

Most of the complexity of the S-1 processor 
arises from the need to keep arithmetic function 
units at the end of a long complex pipeline 
supplied with operands at a continuously high 
rate. (See [Kog81] and [Lor72] for good 
discussions of pipelined processors.) To achieve 
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of a 


this, a variety of techniques is employed, 
including the partial decoupling of the IBOX and 
ABOX with an operand queue, the prediction of 
conditional branches, and the prediction of values 
needed in address computations [Far81]. 

Our model assumes that in the absence of data 
cache misses, all of these techniques work 
perfectly, so that the functional units perform 
work at the maximum possible rate. Clearly this 
may be seriously in error for some computations. 

A Monte Carlo computation with a large number of 
quasi-random conditional branches is likely to run 
more slowly than our model would predict, for 
example, due to reduced effectiveness of the 
hardware conditional branch prediction strategy. 

The model makes an additional simplifying 
assumption: the cost of address arithmetic 
instructions is ignored. As discussed above, this 
is equivalent to assuming that address arithmetic 
calculations are fully overlapped with other 
computations. Although the address arithmetic 
instructions themselves are filtered out, the 
memory reference instructions needed to fetch 
their operands are retained. This is necessary to 
include their effect on the data cache hit ratio 
(typically small). 

The effect of the simplifications we have 
discussed so far is exclusively in the direction 
of predicting performance which is too high. The 
performance results for the Monte Carlo algorithm 
discussed in Section IV probably show these 
effects to a significant degree. The model, 
however, contains additional assumptions which 
work in the other direction. The most important 
of these is ignoring the chaining of vector 
operations on the S-l. Chaining allows triadic 
vector instructions to make simultaneous use of 
the adder and multiplier in the ABOX. This can 
increase the MFLOP rate by up to a factor of 2. 
Measurements on the Cray-1l, which has a more 
general chaining capability, show the effect 
somewhat less than this in most cases. 

Of less importance is the fact that the 
does not reflect the ability of the S-1l to 
calculate special functions (e.g., sin, log, 
nearly as fast as a single multiplication. 
Measurements on the Cray-1l, which calculates 
functions slowly relative to multiplications 
(120 ns vs 12 ns per result in vector mode), 
that even physics simulation codes which make 
intensive use (by current standards) of special 
functions very rarely spend more than 10% of their 
time performing them. It is quite possible, 
however, that algorithms will evolve which exploit 
the S-l's ability to evaluate special functions 
inexpensively. 
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IV. SIMPLE 

SIMPLE [Cro78] is a widely distributed code 
which models the hydrodynamic and thermal behavior 
of fluids in two dimensions. The hydrodynamics is 
a standard Lagrangian formulation using an 
artificial viscosity. Heat transfer is performed 
in the diffusion approximation using a single ADI 
iteration on a five point implicit difference 
operator. Thermodynamic properties .of the fluid 
are obtained by table lookup and biquadratic 
interpolation between table entries. 
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After an initialization phase the calculation 
consists of a sequence of timesteps each of which 
advances the solution by a time increment DT in 
the following manner: 

1. Calculate the pressure in each zone given 
the temperature and density. 


2. Compute the acceleration, new velocity 
and new position of each zone. 

3. Compute new zone volumes. 

4. Compute the artificial viscosity and 
Courant timestep limit for each zone. 

5. Calculate new zone internal energy after 


hydrodynamic work. To maintain sufficient 
accuracy the new energy is first predicted using 
old thermodynamic quantities. The predicted 
energy is then used to calculate more accurate 
thermodynamic quantities which are used for the 
final calculation of the new internal energy. 

6. Calculate new temperature after 
hydrodynamics from new density and internal energy. 


7. Calculate heat diffusion coefficient for 
each zone. 
8. Calculate the coupling constants for the 


column direction (one per zone). 

9. Calculate an intermediate temperature by 
solving a tridiagonal linear system which couples 
zones in the same column and has the temperature 
from step 7 as a right hand side. 

10. Calculate the coupling constants for the 
row direction (one per zone). 

ll. Calculate the new temperature by solving 
a tridiagonal linear system which couples zones in 
the same row and has the temperature from step 9 
as a right hand side. 

12. Calculate a heat diffusion DT for each 
zone from the rate of change of its temperature. 

13. Calculate new zone internal energy from 
new temperature. 

14. Calculate whole problem sums (kinetic 
and internal energy and heat flow across problem 
boundaries) and next DT by finding the minimum 
over the entire mesh of the zonal Courant and heat 
diffusion DT's. 

Although space does not permit a complete 
discussion of these calculational steps, some 
comments are in order. Steps 1 through 6 
constitute the hydrodynamics portion of the 
timestep. The method is explicit, and the new 
values for a zone depend only on the previous 
values of that zone and its six nearest 
neighbors. Steps 7 through 13 constitute the heat 
conduction portion of the timestep. The method is 
implicit, and the new temperature of a zone 
depends on the previous values of all zones in the 
mesh. This difference is quite important for a 
multiprocessor. It is also important to note that 
boundary zones require special treatment for both 
hydrodynamics and heat conduction and require more 
calculations than interior zones. 

We began with a version of SIMPLE which is 
almost completely vectorized by the CFT or CIVIC 
compilers for the Cray-l. This program was 
modified for a multiprocessor in a straightforward 
manner. Each processor is assigned a fixed 
contiguous group of mesh columns for which it is 
responsible at each stage of the calculation. 

With the exception of the heat conduction row 
sweep (discussed below) all calculations are 
vectorized along columns. All arrays are stored 


columnwise, so this results in vector operations 
with unit stride, which is ideal for the S-l. 
Synchronization barriers were emplaced between 
calculation stages when required to ensure the 
proper data dependencies. Fourteen such barriers 
are required within the timestep loop. 
contains a single critical section implemented 
with semaphores which is used for updating scalars 
which depend on the global mesh (step 14). 

With the exception of a single processor 
which is given the additional duty of performing 
output to the edit file, all processors execute 
identical code. In addition to the main data 
structures of the problem, which are in shared 
memory, each processor is supplied with local data 
structures which are stored in processor private 
memory. A few of these, such as loop indices, 
must be supplied to ensure logically correct 
operation. The majority of them, however, are 
made local to increase performance. These include 
the material property tables, arrays holding the 
coupling coefficients for heat conduction, and 
scratchpad arrays used for holding temporary 
results. Local scalars are used to hold each 
processor's contribution to global mesh quantities 
such as total kinetic energy and minimum timestep 
(step 14). 

As mentioned above, the heat conduction 
calculation contains an exception to the column 
group partitioning used in the remainder of the 
code. The heat conduction step in fact raises 
some interesting partitioning issues, and deserves 
special discussion. Since the value of a zone 
temperature after the heat conduction step depends 
on the previous temperatures of all mesh zones, it 
is inevitable that this calculation on a multi- 
processor will involve substantial interprocessor 
communication. There are at least two different 
ways of organizing the calculation. 

1. Straightforward partitioning. During the 
column sweep each processor is given a group of 
columns, while during the row sweep each processor 
is given a group of rows. All interprocessor 
communication is handled invisibly by the cache 
coherence algorithm in the hardware. 

2. Wavefront with row blocking. Each 
processor works with a group of columns for both 
the row and column sweep. During the row sweep 
processor it+l cannot begin on its portion of a row 
until processor i is finished with its portion of 
that row. 

The second method, which is advocated by 
Gilbert [Gil79] attempts to reduce interprocessor 
communication demands at the expense of reduced 
parallelism and increased program complexity. The 
idea is that each processor "owns" the data 
associated with a column group. During the row 
sweep a processor continues to work with this 
local data except at the boundaries of its column 
group, where interprocessor communication is 
required. For large column groups the relative 
cost of this communication becomes small. 

This strategy is effective only if the data 
associated with a column group remains local toa 
processor throughout the calculation. On the S-l 
MkIIa there are two kinds of local data - that 
which is contained in the data cache and that 
which is contained in processor private memory 
external to the cache. Data which is in cache 
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remains there only until it is removed by the 
replacement algorithm to make way for other data. 
This kind of locality is transient and for large 
column groups is destroyed during the column 
Sweep. For the wavefront algorithm to work as 
intended, therefore, the column group data must be 
held in processor local memory. Processors 
utilize the crossbar switch only to send data 
packets directly to other processors. 

The wavefront approach therefore is a form of 
"distributed computing" and as such requires quite 
different programming constructs than employed in 
uniprocessor scientific programs. In particular, 
the extended version of FORTRAN we are using has 
no convenient facilities to distinguish local from 
shared data or for the sending and receiving of 
interprocessor messagese The first approach, 
however, allows the heat conduction part of SIMPLE 
to be programmed in the same style as the rest of 
the code. 

For the simulations reported here we have 
chosen to use the straightforward partitioning 
appropriate to a tightly coupled approach to 
multiprocessing. It is interesting to note that 
for large problems the overhead directly 
attributable to interprocessor communication still 
becomes small. It is of course true that each 
processor writes out many cache lines to main 
memory which are later read by other processors. 
The point is that the vast majority of these reads 
and writes are performed by the normal cache 
replacement algorithm, and would occur even in the 
absence of other processors. The penalty of the 
shared memory approach is then paid mainly in bank 
conflicts. 

The simulator output includes a large number 
of performance diagnostic quantities. In our 
discussion below, we have used the following 
definitions: The cache hit ratio is the ratio of 
the number of memory references originally finding 
a datum in cache divided by the total number of. 
memory references. Efficiency is the ratio of the 
time one processor needs for a calculation divided 
by P times the time it takes P processors to 
finish the same calculation. Traffic ratio is the 
ratio of the number of bits transferred between 
cache and other memory to the total number of bits 
which flow between cache memory and its CPU. 

Speed is measured in megaflops (MFLOPS), 
millions of floating point operations performed 
per second. Speedup is the ratio of the time it 
takes one processor to perform a calculation 
divided by the time it takes a specified number of 
processors to complete the calculation. 

It is important to notice that direct 
comparison of MFLOPS on a parallel machine versus 
a sequential machine may be misleading since many 
parallel algorithms perform significantly more 
arithmetic operations than their sequential 
analogs. For SIMPLE, however, such redundant 
operations are completely negligible. 

We have run problems varying in size from 
20C,20R to 90C,40R (where C denotes number of 
columns and R number of rows) and utilizing from l 
to 16 processors. Most of these have been run in 
Single precision mode, although a few double 
precision runs have also been made. The results 
we report here are for a single execution of steps 
1--14 of the timestep advance. Runs with 


multiple cycles have been performed, and show that 
the transient effects from starting with an empty 
cache are quite small. Table 2 summarizes our 
results for a 90C,20R problem run in Single 
precision with varying numbers of processors. 


Table 2. Simulated performance of SIMPLE. 
Pee EE a tone SRE COD gee 
1 9.04 1.00 1.00 
2 16.00 1.77 0.89 
4 26.45 2.93 0.73 
6 32.17 3.56 0.59 
8 34.7 3.84 0.48 


The efficiency drops rapidly for P > 4 and 
there is clearly little to be gained by running 
this problem on more than eight processors. 

Figures 1 through 5 show detailed simulator 
results for a single run, a 90C,20R single 
precision problem run with four processors. As is 
evident, each phase of the calculation exhibits 
its own pattern of machine activity, so that the 
behavior during the timestep is quite complex. 
This pattern is recognizably similar for all 
multiprocessor SIMPLE runs we have performed, 
regardless of size or number of processors. 
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Step 7,8,9 
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Step 12, 13 
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Figure 1. Simulated performance of four processor 


S-l MkIIa in single precision mode on SIMPLE. The 
problem size is 90C, 20R. The upper portion of the 
figure shows the arrival times of the individual 
processors at the algorithm's synchronization 
points. All processors are restarted after the 
synchronization at the points marked "X". The step 
numbers refer to the calculational steps defined in 
the text. The lower portion of the figure shows the 
average per-processor megaflops vs time measured in 
25 ns ABOX cycles. 
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Figure 2. Simulated performance of four processor 
S-1 MkIIa in single precision mode on SIMPLE. The 
problem size is 90C, 20R. The figure plots the 
average fraction of time spent servicing data 
cache misses vs time in 25 ns ABOX cycles. 


| Eo 
0.50 f - pad —ste 
| i 
/ i 
| 
& | 
2 
> 0.40} - | 
i) ' 
E || | | 
) nf | 
E Vit 
3 | 
Pe} 
: | 
Lo») 
@ 0.30}: | IA 
© 
5 | / 
> ! 
< 
\ 
0.20 : 
0.10 
VA 
0.10 0.20 0.30 040 050 060 0.70 080 090 1.00 
Time (108 cycles) 
Figure 3. Simulated performance of four processor 


S-l MkIIa in single precision mode on SIMPLE. The 
problem size is 90C, 20R. The figure plots the 
average global memory load as a fraction of total 


available bandwidth vs time in 25 ns ABOX cycles. 
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Figure 4. Simulated performance of four processor 


S-1 MkIIa in single precision mode on SIMPLE. The 
problem size is 90C, 20R. The figure plots the 


average local memory load as a fraction of total 
available bandwidth vs time in 25 ns ABOX cycles. 
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Figure 5. Simulated performance of four processor 


S-l MkIIa in single precision mode on SIMPLE. 
problem size is 90C, 20R. The figure plots the 
average fraction of time spent servicing 
interprocessor cache line transfers vs time in 
25 ns ABOX cycles. 
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It is interesting to compare the machine 
activity during the heat conduction row sweep with 
that of the column sweep. The row sweep shows high 
cross traffic loads and takes about 2.75 times as 
long as the column sweep. Both column and row 
Sweeps show extensive use of processor local 
memory, mainly due to the local Storage of 
coupling coefficients. In the light of the 
discussion above we expect this picture to change 
substantially for sufficiently large problems, 
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which should show a less pronounced performance 
difference between column and row sweeps. 

We can easily calculate the problem size for 
which this transition should occur. For the 
90C,20R problem, each processor requires roughly 
5*90*20/4 2250 distinct operands from the point 
it begins to access the shared temperature array 
during the column sweep until the column sweep is 
finished. This is substantially smaller than the 
cache size of 16384 words, so that each processor's 
entire column group is present in cache at the end 
of the column sweep. The subsequent row sweep 
then triggers the observed burst of cross traffic. 
For problems larger by a factor of roughly 8 
(16000 zones), however, this situation will change 
and each processor at the end of the column sweep 
will already have started writing back to shared 
memory the first temperature elements it accessed. 

In spite of its dramatic appearance, the row 
sweep is not the only cause of the inefficiency 
shown in Table 2. Analysis of the simulator runs 
allows us to assign the inefficiency to three 
major causes, as shown below. 


P=4 P= 8 
Waiting at synchronization 0.03 0.06 
barriers: 
Global memory conflicts: 0.15 0.25 
Interprocessor line transfer: 0.06 0.12 
Total: 0.24 0.43 


It is perhaps more useful to express these 
same numbers as "lost CPU's" by multiplying the 
fractional performance loss by P. 


P=4 P= 8 
Waiting at synchronization 0.12 0.48 
barriers: 
Global memory conflicts: 0.60 2.00 
Interprocessor line transfer: 0.24 0.96 
Total: 0.96 3.44 


We clearly must consider how this picture 
will change when the mesh size is greatly 
increased. As argued above, the fractional cost 
of both synchronization waiting and interprocessor 
line transfer should drop steadily with increasing 
mesh size, leaving global memory conflicts as the 
principal cost of multiprocessing. Simple models 
of multiprocessor memories [Yen82] predict that 
crossbar systems with equal numbers of processors 
and memories show an inefficiency due to conflicts 
that is nearly independent of P when P is greater 
than about 8. These facts taken together imply 
that efficiencies for sufficiently large problems 
should be fairly high (about 0.7) even for large 
numbers of processors. 

This is not the end of the story, however. 

We must note that high efficiency does not 
necessarily imply high performance! It merely 
means that performance continues to grow linearly 
as processors are added. On a cache based 
machine, such as the S-l, the performance of each 
uniprocessor drops as problem size is increased. 
Table 3 shows the effect of varying the mesh size 
for SIMPLE in the single processor case. 


Table 3. One processor, varied problem size. 

C R C¥R MFLOPS Hit ratio Tfc ratio 
10 20 200 13.61 0.9956 0.078 
15 20 300 12.68 0.9952 0.087 
30 20 600 11.05 0.9943 0.108 
60 20 1200 9.89 0.9934 0.128 
90 20 1800 9.02 0.9925 0.145 


One may expect that this decline will continue as 
problem size is further increased. The asymptotic 
value is difficult to predict without detailed 
analysis of the algorithm. The issue of 
performance scaling with problem size therefore 
becomes complex. As problem size increases 
performance tends to also increase, due to the 
decreasing relative cost of synchronization 
waiting and interprocessor communication. At the 
Same time, however, performance is negatively 
affected by the decreasing effectiveness of cache 
as the data set size increases. 


V. Conclusion 

The simulations we have reported on were 
undertaken with the goals of investigating the 
performance issues raised by MIMD machines and 
gaining experience with the programming techniques 
required to utilize them. As yet we have explored 
only a limited set of algorithms and a single 
simulated machine. As discussed earlier, we have 
taken existing uniprocessor algorithms and 
extended them to a shared memory multiprocessor 
with minimal changes. Clearly future algorithms 
may depart radically from this approach. Our 
simulations are also deficient in that real 
problems run on fast multiprocessors will in 
general have many more zones than those we have 
been able to treat. 

In view of all these limitations, what have 
we actually learned? Our experience to date with 
the simulator allows us to draw three tentative 
conclusions about the use of MIMD machines for 
solving large scientific problems: 

1. The S-1 (and other similar machines) can 
be used as a multiprocessor in two relatively 
distinct modes. These are a shared memory, or 
tightly coupled, approach in which problem data is 
primarily stored in shared memory; and a 
distributed processing, or loosely coupled, 
approach in which problem data is primarily stored 
in private memory and communication takes place 
when required through the interprocessor message 
network. The shared memory approach, which we 
have used here, is relatively simple to program 
using a few extensions to FORTRAN. The 
distributed computing approach appears to offer 
higher performance for many algorithms. However, 
substantial programming effort and significant 
language extensions would be required to realize 
this potential. 

2. Extrapolation of our simulator results to 
much larger problems indicates that many of the 
factors which limited our speedups in the 4 < P 
< 16 range will be greatly reduced in importance. 
This is particularly true for: 

ae Overhead operations which result from 
partitioning the algorithm. These include process 
management operations and communication between 
processes. 
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b. Synchronization penalty that results 
from speed variation between processes. 
On the other hand, conflicts between processors in 
obtaining access to shared resources (usually 
memory banks and communication paths) will 
continue to be important. In considering the 
scaling issue, the effect of cache memory may 
outweigh all of these factors, however, which 
brings us to our final point. 

3. The usefulness of a traditional data 
cache for scientific problems with large data sets 
appears questionable. In the cases we have 
studied, performance drops rapidly with increasing 
problem size. Clearly this performance drop can 
be delayed with algorithms which optimize cache 
hit rates. It does not appear feasible to do this 
by hand, however, except for simple cases. 
Compilers clearly must perform this task if it is 
to be done at all. The problem is difficult since 
the optimization approach must be dependent on 
data set size. 

In the future we plan to further increase our 
understanding of MIMD machines and algorithms by 
pursuing the comparison of our simulator results 
with actual measurements on the S-1 MkIIa 
multiprocessor, when it becomes available. 
Additionally, we feel that improved analytic 
models for MIMD performance can be constructed 
which will be quite useful. The work of Briggs 
{Bri80], Dubois [Dub82a,b], Yen et al. [Yen82], and 
Gilbert [Gi1l79], among others, provide an 
excellent foundation on which to build them. 
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Abstract 


This paper establishes a simulation model in which vectorizable 
discrete event models can be defined. It gives an algorithm for 
exposing the vectorizable structure in a given model. It applies this 
algorithm to a Monte Carlo particle transport problem known to be 
vectorizable and demonstrates why this problem is vectorizable. A 
few implications of vectorization for definition of models to be 
simulated and for implementation of simulation systems are given. 


*Avinash Chandak is currently employed at the World Bank in 
Washington, D.C. 


The Problem of Vectorization of 


Discrete Event Simulations 


It is an item of folklore in computer science that discrete event 
simulations cannot: be effectively vectorized because of the random 
nature of event generation. This supposed limitation is a non- 
trivial problem since the utility of simulations may depend on levels 
of accuracy which are obtainable only by very long runs on the 
fastest of today’s scalar computers. This paper establishes the 
conditions under which discrete event simulation can be vectorized. 
The use of the framework which is given here may lead to the 
selection of model representations which are more vectorizable than 
equivalent alternative representations. 


Application of the procedure is illustrated by consideration of the 
structure of a model for Monte Carlo particle transport process. 
Brown, Callahan and Martin [2] showed by analysis of the actual 
code that it is almost entirely vectorizable. The analysis given 
herein shows why this is the case. 


The conditions for vectorizability are very similar to those 
established by Georgiadis, Papazoglou and Maritsas [3] for MIMD 
parallel structuring of SIMULA programs. 


There are three levels of approach to vectorization. The first is 
to look at existing code to determine what can be vectorized. The 
second level is to look for algorithms which can be effectively 
vectorized. The third level is for structural formulations of a 
problem directly in terms of vectors. This study began at level 2 
and, in particular, examined vectorization of time axis or event list 
processing. This portion of the project was successful in the sense 
that a vector formulation of time axis management produced a 
major (factor of two) improvement in run times for cases where 
time axis management was a major factor in total run time [1]. 
This factor of two was actually demonstrated in evaluations of 
model systems defined in the simulation system described following 
on a CDC CYBER 205. Further improvement at the algorithm 
level seemed unlikely so we turned attention to the basic structure 
of discrete event problems as expressed in the rather simple 
simulation modeling system developed for the level 2 studies. This 
approach leads to the results given herein. 


Definition of the Simulation Modeling System 


The modeling capability of this simulation system consists of the 
following major constructs and some system provided routines. 


1. Transactions. These are entities which consist of data only. 
They may have some (or no) user defined parameters. Several 
different transaction types may be defined. The system maintains 
space for transactions. 


2. Transaction Generators. These are user defined routines 


which generate transactions. A transaction generator has a 
transaction type associated with it. 
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3. Facilities. These are user defined routines which service the 
transactions. They manipulate the user defined transaction 
parameters. Each facility services some queue(s). They specify the 
service time needed by using the system routine SIMWAIT. 


4. Queues. These are used to hold transactions which are 
waiting to get service at a facility. A queue is serviced by one 
facility, or a group of facilities. 


5. Common Variables. These are variables used by transaction 
generators and facilities to hold data between activations and to 
pass information between entities in the model. 


The system maintains some status on transactions by type, on 
facilities and on queues. These are available to the user at the 
completion of the simulation. 


The system routines whose use affects vectorizability are 
described following. 


1. The "SEND" routine. This is used by facilities and 
transaction generators to insert transactions into queues. Multiple 
executions will result in multiple copies of the transaction being 
created. 


2. The "COLLECT" routine. This is used to reduce the number 
of copies of the transaction by one, or if there is only one copy it 
has the same effect as the "SEND" routine. This routine functions 
by searching all of the storage area for a given transaction type. 


3. The "START" routine. This is used by transaction generators 


_ and facilities to enable (activate) transaction generators and 
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facilities. If the affected routine is already active, it has no effect. 
4. The “STOP" routine. This is the inverse of the "START" 


routine. 


The following two constructs limit vectorizability when used in 
the definition of a model system to be simulated. 


Any routine (facility or transaction generator) which "SEND"s a 
transaction to the front of a queue. Only those queueing disciplines 
which yield fixed ordering in queues lead to vectorizable code. 


We also need the restriction that the "COLLECT" routine only 
searches the queues served by the facility doing the "COLLECT". 
If this restriction is not obeyed then all facilities servicing 
transaction types which have some facility in the simulation doing 
a “COLLECT® on transactions of that type lose vectorizability. 


It should be pointed out that some of the operations in the 
simulation model are not vectorizable. At the simulation system 
level the non-vectorizable operations include those that maintain 
statistics for the various queues. This is because the statistics for 
the (n)th transaction are a function of the numbers for the (n-1)th 
transaction. The pipeline used for vector computations accepts a 
new set of operands before the results from the previous set are 
available. Therefore any operation which uses the (n-1)th result to 
determine the (n)th result is not vectorizable. An example is the 
waiting time in a queue. It is determined by the departure and 
service times of the previous transaction. There are other such 
instances and these must be handled in scalar mode. 


Analysis for Vectorizability: A Graph Algorithm 


The algorithm for determining the vectorizability for simulation 
evaluation of a given model is given following. It results in the 
construction of a graph. The input is the user description of the 
simulation. 


1. Draw a node for each of the queues, transaction generators, 
facilities and common variables and mark each node "facility", 
"queue", “transaction generator" or “common variable* as 
appropriate. 


2. For a set of facilities in a facility group, draw an arc from a 
facility which has no outgoing arc to another which has no 
incoming arc. This should result in a cycle connecting all the 
facilities in a facility group. 


3. For all facilities and transaction generators, draw an arc to 
each queue to which it "SEND"s transactions. 


4. For all queues, draw an arc to each facility that services the 
queue. 


5. For each common variable, draw an arc to a facility or 
transaction generator if it is used to: (i) change a local variable, (ii) 
change a different common variable, (iii) change a transaction 
variable, (iv) affect the flow of control inside the facility or 
transaction generator (e.g., in an IF statement), or (v) affect the 
flow of a transaction inside the simulation (e.g., in a "SEND"). 


6. For each facility and transaction generator, draw an arc to 
each common variable whose value is changed at some point inside 
the facility or transaction generator. 


7. For every "START" or "STOP® operation, create a new node. 
Draw an arc from the facility or transaction generator doing the 
"START" or "STOP" to the new node and from the new node to 
the facility or transaction generator being "START"ed or 
*STOP*ped. Mark the new node with a "“START® or *STOP®. 


8. For every facility which specifies a service time of zero, merge 
the ‘nodes for~ the queues" serving~ the: favittily- -into- the: favitily. 
Delete any arcs forming a self loop for such facilities. 


9. For every facility which does a "COLLECT", create a new 
node, mark it "COLLECT", draw an arc from the facility to the 
new node and draw an arc from the new node to the facility. 


The algorithm given above can be implemented as part of the 
preprocessor described in [1). The conditions precluding 
vectorizability are: 


1. Any node (representing a facility or transaction generator) 
with more than one arc entering it cannot be vectorized. 


2. If there exists a cycle in the graph then the operation of any 
node in that cycle cannot be vectorized. 


3. For any path in the graph containing a "START® or "STOP" 
node, any node in that path cannot be vectorized. 


The arcs in the graph represent information flow in the 
simulation. Information from various sources can interact at the 
nodes representing facilities and transaction generators. This 
means that a node with multiple arcs entering it must synchronize 
the information from the different sources, and, in general, cannot 
operate in the vector mode. It may be possible to vectorize part of 
the operation at that node, but it cannot be totally vectorized. A 
cycle means that all the nodes in the cycle must be synchronized, 
and none of them can be vectorized. A path containing a 
"“START® or "STOP® node means that both the routine doing the 
"START® or "STOP" and the routine being affected must be 
synchronized and that both must know the correct simulation time. 
In a vectorized simulation, various routines can be operating at 
different simulation times, as long as their operations do not 
interact. 


It is possible that the graph is not connected. This means that 
the information and flow represented by the different pieces of the 
graph do not interact at all, and can be implemented as different 
problems. 


The original model represented the simulation as a set of 
operations on single transactions. The default unit of information 
in the vectorized model is a vector of transactions. All operations 
are carried out on vectors. Where needed, operations can be 
carried out on a single entry in a vector. ! 
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Each common variable will be represented as two vectors, one 
containing the value and another containing the corresponding 
simulation time. If we have the situation shown in Figure 1, then a 
scalar procedure is needed to convert the vector, because the length 
and times of the vector produced by the facility writing the 
common variable need not be the same as the length and times of 
the vector needed by the facility reading the common variable. 


S common variable SS 


Figure 1: 
A Structure Requiring Scalar Conversions 


The unlabelled nodes are facilities or transaction generators. 


For the situation shown in Figure 2, a scalar procedure is needed 
to merge the transaction vectors into one vector. The resulting 


vector is then passed to the facility for processing. 


Figure 2: 


Another Structure Requiring Scalar Conversions 


The algorithm for vectorization actually points out situations in 
the simulation which cannot be vectorized. It assumes that 
vectorization is possible unless something prevents it. 


The amount of vectorization strongly depends on the model 
representation of the system being simulated. An examination of 
the graph might suggest changes to the simulation that enhance 
vectorizability. In general, real time in the problem _ being 
simulated is mapped onto simulation time. If this leads to a 
complex model, then a simulation where real time is not mapped 
onto simulation time should be considered. This alternative is 
possible when the problem does not have queueing and competition 
for resources. 


Monte Carlo Particle Transport: 
A Completely Vectorizable Problem 


Monte Carlo solution of particle transport problems [5] can be 
cast as a discrete event simulation problem. Brown, Callahan and 
Martin [2] showed from examination of an actual code that it could 
be almost completely vectorized. This. section applies the 
algorithm developed in the last section to this problem to illustrate 
analysis for vectorizability. 


If real time is mapped onto simulation time for the particle 
transport problem, then we have a very complex simulation. We 
represent particles as transactions and run the simulation so that 
the particles undergo collisions, absorption or splitting at different 
times. Because simulation time and real time correspond, a queue 
in which insertion at an arbitrary point is possible is needed. This 
is done in the simulation model of [1] by means of a number of 
queues of different priority classes. This scheme would require a 
very large number of queues. | 


There is no competition for resources in this problem and hence 
there is no need for queues in the simulation. Consider a 
simulation model in which a transaction generator produces 
particles at random simulation times. In this model real time is not 
mapped onto simulation time but is represented as a transaction 
parameter. A facility processes each particle through all its events 
until it and all particles it creates (by splitting) are absorbed. It 
does this in zero simulation time. The particles are represented by 
x, y and z coordinates and x, y and z velocity components and 
time. The environment is represented by a set of constant global 
variables. The algorithm is applied to‘this model and the resulting 
graph is shown at each step. The resulting graphs of Figure 3 show 


that this form of the particle transport problem can be vectorized. 
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Figure 3.a: The Graph 
after Step 1 


Figure 3.b: The Graph 


after Step 3 
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Figure 3.d: The Graph 
after Step 5 
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Figure 3.f: The Graph 
after Step 7 


Figure 3.c: The Graph 
after Step 4 
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Figure 3.e: The Graph 
after Step 6 
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Figure 3.g: The Graph 
after Step 8 


The resulting graph indicates that the problem is not totally 
vectorizable. It needs a scalar routine to convert the common 
variable into a vector of the appropriate length. Note, however, 
that the common variable node has no incoming arcs. If a common 
variable has no incoming arcs, it means that variable is set up 
during initialization and never changed after that. It can be 
treated as a constant. 


This means that in the graph we can delete all arcs going out 
from common variable nodes which have no incoming arcs. 


The graph resulting after that transformation is given below, and 
it indicates that the particle transport problem is totally 
vectorizable. 
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Figure 4: The Graph Structure 
for the Particle Transport Problem 


Implementation Considerations 


A simulation program structured to take advantage of 
vectorizability present in a given model may look very different 
from a conventionally structured simulation program. Each 
transaction generator action will produce a vector of transactions. 
Each activation of a facility will process a vector of transaction 
steps and each SEND execution will move vectors of transactions 
between facilities and queues or facilities and facilities. In addition 
to the speed-up which will derive directly from vectorization, there 
will also be a major saving in execution time from the deletion of 
the large number of subroutine calls normally required to generate 
and process a transaction. There will be only a single subroutine 
call to process an entire vector of transaction steps instead of a call 
per transaction step. 


There will, however, be an additional storage cost. The vectors 
of transactions will have to have storage at each queue and facility 
or else sharing will have to be designed into the code structures. 
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ABSTRACT 


Although backward error recovery with recovery biocks 
(RB's) has received considerable attention from many 
researchers, no attempt has been made to structure its 
implementation alternatives and then to evaluate/analyze 
their effectiveness. In this paper we categorize three dif- 
ferent methods of impiementing RB's. These are the asyn- 
chronous, synchronous, and the pseudo recovery point (PRP) 
implementations. We have developed probabilistic models 
for estimating (i) the interval between two successive 
recovery lines for asynchronous RB's, (ii) mean loss in com- 
putation power for the synchronous method, and (iii) addi- 
tional overhead and rollback distance in case PRP's are 
used. 


1. INTRODUCTION 


The best known technique of backward error recovery, 
the recovery block (RB), was proposed by Horning [1] and 
Randeli [2]. It is a sequential program structure that con- 
sists of an acceptance test (AT), a recovery point (RP), and 
alternative processes for a given process. In case an error 
is detected or the AT fails, the process rolls back to an old 
states saved at the previous RP and executes one of the 
other alternatives. Unfortunately, for cooperating con- 
current processes the rollback of a process may cause 
other processes to roll back (this phenomenon is called roil- 
back propagation ) because of process interactions and 
imperfect checking of global correctness. This rollback pro- 
pagation continues until it reaches a recovery line [3] at 
which a globally consistent state does exist. In the worst 
case, an avalanche of rollback propagation (called the dem- 
ino effect ) can push the processes back to their begin- 
nings. The interval between the restart point and the time 
point at which an error is detected, called the rollback dis- 
fance, can be used to represent the computation loss in 
rollback recovery. 


The domino effect and rollback propagation are the 
major obstacles in implementing the recovery block scheme 
for concurrent processes. Furthermore, decision on rollback 
propagation and determination of recovery lines will become 
more complex though they can be made in a centralized [4] 
or decentraiized manner [5,6 ]. 


Several refinements have been proposed to overcome 
the drawbacks in this recovery block scheme. One approach 
is to put concurrent processes into a controlled scope, i.e., 
to synchronize the occurrence of acceptance tests. Randell 
[2] has suggested the conversation scheme _ which 
requests every cooperating concurrent process to leave its 
acceptance test at the same moment (called fest line ). 


The work reported here is supported in part by National Aeronautics 
and Space Administration Grant No. NAG 1-296. Any opinions, findings, and 
conclusions cr recommendations in this publication are those of the authors 
and do not necessarily reflect the view of NASA. 
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Other mechanizations of the conversation scheme on the 
basis of the same concept but with more flexibility have 
been devised by Kim [7]. Synchronized rollback recovery 
schemes for transactions using a two-phase commitment 
protocol or transaction ordering are also studied in [8,9}. 
Another approach is to save additional states based on the 
occurrence of interactions; fcr example, the branch 
recovery points [10] and the system defined checkpoints 
(SDCP) [11]. 


In this paper we propose to employ pseudo 


recovery points | (PRP's) to alleviate the rollback propaga- 
tion problem by allowing a process to restart at a PRP in 
case the process is forced to roll back by others as a 
result of rollback propagation. Therefore, we can classify 
these refinements into two categories, synchronized 
recovery blocks and pseudo recovery poinis, providing a 
contrast with the third category called asynchronous 
recovery blocks. To implement the rollback recovery 
schemes, we have to consider various trade-offs between 
these three categories and the characteristics of con- 
current processes. !t is necessary to perform quantitative © 
analyses for estimating the mean amount of computation 
undone in case processes roll back, the optimal interval 
between two successive synchronizations, the mean size of 
memory space required to save states, etc. 


In the following section, several assumptions are dis- 
cussed and then a model for asynchronous recovery blocks 
is introduced. Using this model, we employ simulations to 
present the probability distribution of the interval between 
two successive recevery lines. In Sections 3 and 4, the 
synchronization method and tne implantation of pseudo 
recovery points are evaluated respectively. The paper con- 
cludes with Section 5. 


2. EVALUATION OF ASYNCHRONOUS RECOVERY BLOCKS 


Let us consider the history diagram in Figure 1 to illus- 
trate the activities of cooperating concurrent processes F;,, 
4=1,2,...m. Let set AC}1,...,773, ie. a subset of the indices 
of concurrent processes and let RP} be the j-th recovery 
point of P,. Then one may find a combination of RP} for all 
t<A, which forms a recovery line for set A, denoted” as RLA 
for the rth recovery line. For simplicity superscripts in 
representing recovery lines will be omitted in the sequel as 
long as that does not result in ambiguity. The interval 
between two successive recovery lines AL, and FL,,, in 
process P,, i¢€A is a random variable and denoted by X?. 
Since a recovery line provides globally consistent states to 
all members of process set A, it is reasonable to assume 
that X? is stochastically identical for all icA. Thus, X, is 
used to represent the interval between the 7-th and 
(r+1)-th recovery lines. 


' We call It a pseudo recovery point (PRP ) since there is no accep- 
tance test before the saving of process state at a PRP. The states recorded 
at PRP's may have been contaminated and thus can not be used to recover a 
faiied process. 


Py P2 Py 
time 
RPI of eS 
RL, --x-- 
x} 2 
1 xX? x; 
x, 
RLe --¥--+~--4 
TOP gy. RS ete Se 
RP, 
ATE g 


interaction 
P, fails at AT} 


Figure 1. A History Diagram of Occurrence of 
Interactions and Recovery Points 


2.1. Modeling Assumptions 


We make the following assumptions in our subsequent 
analyses. 


1. Automomous Processes; Cooperative autonomy is 
regarded as the most important requirement in distributed 
processing. Each process should be executed according 
to its own program and environment, almost as if there 
were no processes to interfere with. Thus, processes 
will transmit messages or establish their recovery points 
independently of other processes. 


. Perfect Acceptance Tesf: Acceptance tests should 
detect all errors within the local process during the exe- 
cution of recovery blocks and thus ensure the correct- 
ness of local execution. At least, the computation 
results that have passed the acceptance test should be 
"acceptable’[3]. However, the local acceptance test 
may or may not detect external errors or erroneous mes- 
sages since a local process is not aware of the global 
system and other processes. 


. Probability Distribution of Interactions: Usually, pro- 
cess behavior is modeled as an ordered sequence which 
in turn is specified by the program and dependent on 
execution conditions. Even if the processing sequence is 
given, the interval between two successive interactions 
is variable due to conditional branches. Locking and 
waiting at shared resources make it even more uncertain. 
Nontheless, for both tractability and simplicity we have 
adopted here constant reference rates in the multipro- 
cessor and exponentially distributed intervals between 
two successive message transmissions in the computer 
network. The interval for two successive interactions 
between P, and FP; is thus assumed to be exponentially 
distributed with mean 1/A,;, and A,j;=A, for all 
tig Soe and LA]. 


. Consistent Communications: Let two messages m, and 
m, be sent from /, to P;. Consistent communications 
should satisfy : (i) every message sent from P, to FP; will 
be received eventually by P;, and (ii) m,and m, are 
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received by P,; in the same logical order as that they are 
sent. 


. Distribution of Recovery Points: Because of process 
independence and the uncertainty of execution 
conditions, the appearances of recovery points are ran- 
dom and difficult to model. To avoid complexity, estab- 
lishment of recovery points in a process is assumed to be 
an independent Poisson process with parameter jy, for 
process P,. 


2.2. A Model for Asynchronous Recovery Slocks 


Since individual recovery points by themselves may not 
be sufficient in rollback recovery, we consider only the for- 
mation of recovery lines for asynchronous recovery blocks. 
The requirements of a recovery iine for processes #; for 
i=i,2,...n can be stated as follows: 


1. Each recovery line has to include one recovery point 
RP! for every process P;. 

Let the moment of establishment of the jth recovery 
point in process P; be t(RP}) and iet bo be the mo- 
ment of the gth interaction from /, to P;. For every 
pair CRP; _ RP}) in a recovery line, there does not 
exist an integer k such that ff’ <[t(RP?), t( RP}. )] if 
t( RP?) <t(RP*) (ti'e[t(RPt), t(RP#] otherwise). 


2. 


The basic idea underlying the model is to trace the 
occurrence of both recovery points and interactions. Based 
on the assumptions in Section 2.1, random variable xX, can 
be modeled by a continuous-time Markov process starting 
from a recovery line (AZ,) and ending at the next recovery 
line (AL,,,). For a set of processes, 0,.{P; |i=1,2,...,n}, 
two types of states are defined: 


(a). End states S, and S,,,: transitions start from S, 
where ail processes have formed the rth recovery 
line, and end at S,,, upon establishment of the 
(r+1)th recovery line. 

(b). Intermediate states S =(z,;,2%2,...,%,), where 
2z,=0 if the previous action of P; was an interaction, 
and 2;=1 if it was establishment of a recovery point. 


Occurrences of interactions and recovery points in a 
process make the system go through these states. Note 
that both S, and ©,,, are equivalent to state {i,1,...,1). We 
can establish the following transition rules: 


R1. The system goes to state (%,,...,.2;-),1.0544,...2n) 


from state (%1,..,%;-),0,%;4),,...%,) with rate 4, upon 
establishment of a recovery point in P;. 


R2. The system leaves state 
(21,.%¢—-1,1.2441,..,27-1,1,2;41, ..2_,) and enters state 
2 Ais 
to Srei 
+ Met 
a from Sy 
entry 


AistAz3.H3 


es 
*to state (0,0,0) 


Figure 2. The Model of Asynchronous RB's for 3 Processes 


(21,..,24-1,.0,2¢41,..%j-1,0,2541,...2,) with rate 4; if 

there is an interaction between P; and P;. 

R38. The system arrives at state (2,,..,.2;-),0,2;4),..,2n) 
from state (2,,..,23-1,1,21441,...2, ) with transition rate 

Do My Where B,={7 | 2;=0, j #i and j€A}. 

jes, 

R4. The system can transfer directly from state S, to 
nr 

state S,,, with transition rate 5) pp. 

k=) — 

Under these transition rules a Markov model is 
developed for three processes FP,, Pe and Ps, and 
presented in Fig. 2. The single-arrow tines are unidirectional 
transitions. The double-arrow lines are bidireciional transi- 
tions in which left-hand side parameters represent leftward 
transition rates and right-hand side parameters rightward 
transition rates. | 


When fj =M¢j;=~ and Ay; =A for all tz, 7 © A, the model 


can be simplified since all intermediate states 
S=(z,,%_. ..,%,) containing exactly wu 1's _ in 
(z,,20,...,2%,) can be replaced by a single state S,. A 


simplified model is obtained under the following transition 
rules and presented in Fig. 3. 


R1'. For uw = Ol etal , the system will move to state 
oa, from state &, with transition rate (n —u)w when 
a new recovery point is formed. 


R2'. For all wu = 2, the system is able to leave state S, 
for state 5, _. with rate Au (u—1)/ 2. . 

R3'. For all wu > 1, there is a transition from state 5S, to 
state S.,_, with rate Au(n—w). 

R4'. The system can transfer directly from the entry state 


S, to the terminal state S,,, with transition rate 7 ju. 


Tee ee ow 


a (n—1yn ~2)d 


Figure 3. The Simplified Model of Asynchronous 
RB's for n Processes 


2.3. The Analysis of Asynchronous Recovery Blocks 


When the occurrences of interprocess communication 
and recovery point are exponentially distributed, X, for all r 
becomes stochastically identical. Let X denote a random 
variable representing the interval between two successive 
recovery lines. The probability distribution of X is derived 
below. 
let the state space ¥={0,1,2.....™} where m=2" be 
the set of states of the foregoing continuous-time Markov 
process with the following convention for numbering states: 
(a). S,--> state 0, 
(b). an intermediate state (z,,2¢, . 
Te * 
ime +1), and 
i=1 
(c). 5,41 --> state m. 
Then, the Chapman-Kolmogorov equation becomes 


, ,2y,) --> state 


Oph 7 
ag TE) = n(ft)H 


where His the (mxm.) transition matrix [h(u,v)] in which 
the (u,v) element is the transition rate from state u to 
state v, and n{f) is a vector whose kth element is the pro- 


bability that the system is in state k at time {. The initial 
condition is m(0)=[1,0,0...,0]. The interval between two 
successive recovery lines, X, is equal to the time needed 
for transition from state 0 to state m. Therefore, the den- 
sity function of X, namely f,(£), is equal to d(7,,(f))/ dt. 


Suppose process FP; detects an error or fails tne 
acceptance test at one o7 its recovery points RP}, where 
j=i1,2,...,L;. The rollback of FP; may propagate to k 
processes in the process set, 0, = {F,| 1¢A} where 
A={1,2,...,.23. Let Df be the rollback distance associated 
with the k processes and RP; for j=1,2,...4;. Then, X 
represents the supremum of these random variables, i.e., 
Dj. . In Figure 5, the mean values of X are plotted as a 
function of vn. It shows that X increases drastically when 
there is an increase in the number of processes involved in 
the rollback recovery. The density function of X, f,{t), is 
plotted in Figure 6. For ail the three cases in Fig. 6, there is 
a sharp pulse near £=0, which is due to direct transitions 
between S, and S,,, and a longer transition time needed 
once the system enters intermediate states. 
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Figure 4. Mean value of X vs.the number of processes 
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Figure 5. The Density Function of X, f,(t) 


3. SYNCHRONIZED RECOVERY BLOCKS 
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The simpiest way of avoiding unbounded rollback pro- 
pagations is to synchronize the establishment of recovery 
points during process execution. In this method, interac- 
tions are inhibited between any pair of processes during 
their establishment of recovery points. There are three con- 
ceivable strategies in deciding when a _ synchronization 


request is to be issued: (1) at a constant interval; (2) when 
the time elapsed since the previous recovery line exceeds a 
specified value; or (3) when the number of states saved 
after the previous recovery line is larger than a prespeci- 
fied number. The impiementation of the first strategy is sim- 
ple since the synchronization request is issued without any 
knowledge of the state of execution. Nevertheless, this 
strategy may become very inefficient since it is possible to 
make synchronization requests immediately after the forma- 
tion of a recovery line. For the second and third strategies, 
rollback distance and the number of saved states are 
preventcd from becoming too Sarge. However, in this case 
each process must be aware of the occurrence of a 
recovery iine whenever it is established. 


Upon receipt of a synchronization request, every pro- 
cess has tv prepare for establishing a recovery line and also 
has to wait for the commitment (for establishing a recovery 
line) from other processes before it executes an accep- 
tance test. Thus, all cooperating processes perform their 
acceptance tests at the same instant upon receiving the 
commitments from all other processes. Let Py —ready, for 
j=1,2,...2, be the flags in process P; to indicate commit- 
ments from F;. The steps for synchronization in each pro- 


cess P; are described as follows: 

$1. execute "its own normal process’ until "acceptance 
test”; 

$2. set P;,-ready := ON and then broadcast /,,—ready ; 


$3. while not (all P,; -ready = ON) do 
receive messages; 
if a message is P;;—ready then 
set P,;-ready := ON 
else record the message 
reset P,,;—ready := OFF, for 7=1,¢,...n and 
do "acceptance test" and record process states. 


S4. 


Establishment of recovery lines upon synchronization 
requests is shown in Figure 6. Synchronization causes the 
computation power to be diminished because processes 
have to wait for the commitment (as in $3). Let y, be the 
interval between the receiving of a synchronization request 
and the moment that process F; reaches its next accep- 
tance test (in S1)}. Then, according to the assumptions in 
Section 2 1, y; is an exponentially distributed random vari- 
able with parameter jz. Let Z=MAaxty1, Yo,.--.Yn}- The 


totai loss in computation power is CL= Ss (Z~-y;). The mean 


loss becomes t=1 
a | 
Cy 
ae —e 


a2. Mi 


Ch =n fi-F F,(t))dt ~ 
0 


where F,(£) is the distribution function of 7, and equals 


nr — fh 
ds) 
t=1 — 


4. IMPLANTATION OF PSEUDO RECOVERY POINTS 


in the construction of a recovery block, usually, an 
acceptance test is a number of executable assessments 
provided by the programmer and then followed by a state 
saving. Note that prccess states can also be recorded upon 
any cther requests if they are considered useful in the roll- 
back recovery. A pseudo recovery point (PRP) is defined 
as a recovery point that is established without a preceding 
acceptance test and is proposed here as an alternative for 
avoiding the domino effect in a set of cooperating con- 
current processes. With a monitor as the interprocess com- 
munication means, Kim [10] and Kant and Silberschatz [11] 
discussed methods for implanting recovery points in a cen- 
tralized manner. Similarly, we consider a method for implant- 
ing PRP's in the set of cooperating concurrent processes in 
a decentralized manner. 
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Figure 6. Establishment of Recovery Lines upon 
Synchronization Requests 


synchronization 
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To make every recovery point RP} in P; maximally use- 
ful for rollback error recovery, there should be corresponding 
recovery points in the other processes that have to roll 
back as a result of the rollback propagation from /;. If such 
recovery points do not actually exist, a pseudo recovery 
point, PRP} , has to be inserted in process ”,. for a given 
RP} in process F,. Further, in order to avoid the need of 
tracing recoverv points at that particular moment, a PRP is 
established in each of the other processes involved with 
RP}. An algorithm for implanting PRP's is given below. 


(1). When F, establishes a recovery point RP3, it broad- 
casts a PRP implantation request to other processes. 
lif #. receives the implantation request, it records its 
State as PRP* upon completion of the current in- 
struction without an acceptance test. Then P;. broad- 
casts its commitment C;:. 

Every process executes its cwn normal task after it 
establishes RP} or PRP**. However, the messages 
sent to other processes by Ff; prior to C; have to be 
retained in the state saved. 


(2). 


(3). 


Assume that process P, detects ain error before estab- 
lishing RP},, and that this error is local to P;. The recovery 
line (called a pseudo recovery line, PRL}) formed by RP} 
and all PREY 's is able to recover these processes even if 
the error has already propagated to other processes. How- 
ever, when the error detected in F; is due to error propaga- 
tion from another process, FP, (and therefore not local to 
P,), the contents of PRP? may have already been contam- 
inated if this error occurred prior to establishing PRP}. The 
restart from the pseudo recovery line formed by both RP} 
and all FREE" Ss may just reproduce the same error. There- 
fore, rollback propagation may continue until every process 
involved has rolled back to a pseudo recovery line past at 
least one of its recovery points. Consequently, the pseudo 
recovery line allows the processes to have the shortest roll- 
back distance for backward error recovery without syn- 
chronization. Note that the pseudo recovery line is now 
guaranteed to contain correct states of all concerned 
processes. An algorithm of rollback recovery with these 
mseudo recovery points is given by: 


(1). If an error is found in process P;, set p := i where p 
is a rollback pointer. 


(2). f, roils back to its previous recovery point APP. All 
processes /,. affected by the rollback of F, roll 


back to their respective pseudo recovery points 


(3). For every affected processes P,', if the rollback has 
not passed its most recent recovery point, then set 
p :=7%' and go back to step 2. 


_ In Figure 7, the establishment of PRF's in processes 
P,, P2, and Pg is illustrated. When P, fails its acceptance 
test ATS, all processes have to restart from the pseudo 
recovery line formed by (RP}, PRP}*, PRP}®) if P, and Pz 
are affected by the rollback of Ps. 


In the above algorithm, we can find that every process 
needs to preserve a recovery point for restart in case it 
faiis. Also (n—1) pseudo recovery points are needed for a 
process to form pseudo recovery lines with otner processes 
where n is the total number of concurrent processes. The 
old RP's and PRP's except those in the pseudo recovery 
lines {PRL}i=1,...2, and RP} is the most recent RP in P;} 
can be purged when a new recovery point is established, 
thereby reducing storage requirements for each process. 
Note that rollback distance is bounded by the supremum of 
lyi.Ye,.-..,Yn> where y,; is the interval between two suc- 
cessive recovery points of process P;. The additional time 
overhead for every recovery point is (n—1)t, where ¢, is 
the time needed to record the process state. These over- 
heads should be assessed against the gain of process 
autonomy and avoidance of unbounded rollback propaga- 
tions. 


time | 


PRP}8 


restart line with 
respect to the 


implantation failure of Ps at AT3 


request 


[—1 : Recovery Point (RP) 
tz : Pseudo Recovery Point (PRP) 


Note: all occurrences of interactions are omitted 


Figure 7. Establishment of Pseudo Recovery Points 
for Rojlback Error Recovery 


5. CONCLUSION 


We have quantitatively evaluated three different 
recovery blocks employed in backward error recovery for 
concurrent processing and have estimated the overhead 
required to avoid the domino effect when recovery or 
pseudo recovery points are employed. For both the syn- 
chronization method and the implantation of pseudo recovery 
points, the overheads are largely related to the construction 
of synchronization, and PRP's. They would become an 
unacceptable burden when synchronizations and pseudo 
recovery points are constructed frequently but interprocess 
communications do rarely occur. At the other extreme, i.e. 
asynchronous recovery blocks, it may result in a longer roll- 
back distance due to unlimited rollback propagations. 
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To select a suitable strategy or a combination of these 
three methods, we have to first examine the properties of 
concurrent processes such as the amount of interprocess 
communications and the distribution of recovery points. 
Then, we weigh the trade-off between the loss of computa- 
tion power during norma! operation and the increase in. 
response time due to rollback recovery. in general, if more 
knowledge of the execution state in concurrent processes 
can be obtained, a better strategy for implementing 
recovery biocks can be derived. 
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IMPROVED MULTIPROCESSOR GARBAGE COLLECTION ALGORITHMS 


Newman I.A., Stallard R.P., Woodward M.C. 
Department of Computer Studies 
Loughborough University of Technology 
Loughborough, Leicestershire, U.K. 


Abstract ~-- This paper outlines the results 
of an investigation of existing multiprocessor 
garbage collection algorithms and introduces two 
new algorithms which significantly improve some 
aspects of the performance of their predecessors. 
The two algorithms arise from different starting 
assumptions. One considers the case where the 
algorithm will terminate successfully whatever 
list structure is being processed and assumes 
that the extra data space should be minimised. 
The other seeks a very fast garbage collection 
time for list structures that do not contain loops. 
Results of both theoretical and experimental 
investigations are given to demonstrate the 
efficacy of the algorithms. 


Introduction 


A number of previous papers have considered 
the problems of reclaiming space in a list 
processing system, dynamically, while processing 
(by 'mutators') continues [2,3,4,6,7]. Most 
papers have studied an environment in which one 
process, the garbage collector, works alongside a 
second process, the mutator, with the processes 
normally executing on different processors. 
However, some algorithms allow extensions in which 
several garbage collector processes can be active 
simultaneously. The garbage collector process 
generally has three phases 'set-up', 'marking' and 
‘collection’. In the first, all nodes are marked 
as'garbage', by setting a mark bit (or bits) 
appropriately (referred to as ‘colouring’ the node 
‘white'). In the second all nodes that can be 
reached from the roots of the list structure are 
marked as accessible (coloured 'black'). This 
marking may have several stages in which the mark 
state (colour) of the node changes which 
necessitates more than one mark bit. Finally, all 
the unmarked (white) nodes are added to the free 
list in the third phase. The set-up phase is 
typically executed as a by-product of the 
collection phase. A possible fourth phase, in 
which all accessible nodes are compacted into the 
minimum physical space, is not generally 
considered. 


Background 


The performance of various garbage collection 
algorithms has been studied as part of an ongoing 
investigation into applications of multiprocessor 
systems in which each processor has its own 
private memory and, in addition, memory and 
possibly other resources are shared. A previous 
paper [4] reported the results of simulation 
studies on two algorithms (referred to as ‘Lamport' 
and '‘'Chaining'). A more detailed theoretical and 
experimental analysis of these algorithms together 
with a third ('stacking'[7]) has been carried out 
[5]. The collection phase is identical for all 
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three algorithms and can be efficiently multi- 
programmed by partitioning the node space. The 
investigation has, therefore, centered on the 
marking phase. Each algorithm was implemented as 
part of a simple list processor executing on a 
system comprising four TI 990/10 computers [1]. 
Results were taken for several types of list 
structures in a node space in which all nodes 
were of the same size and were compared with a 
theoretical analysis of the expected performance 
of each algorithm. 


Analysis 


Both the Lamport and Stacking algorithms will 
guarantee to mark any structure. The former 
requires a two bit field in each node to permit 
marking while the latter requires only one bit in 
the node but also a stack potentially capable of 
taking pointers to all the nodes in the structure 
except for one (the original root). The operation 
of these two algorithms with multiple markers is 
quite different. The Lamport algorithm 
effectively requires each marker to access a 
physically separate portion of the node space. 
Each marker examines nodes in its own node space 
until one finds and marks an accessible (‘shaded’) 
node and shades its successors. All markers then 
"reset', restarting the search of their node 
space. Thus, with most list structures, only one 
marker does any useful work between resets. By 
contrast, markers under the Stacking algorithm 
traverse the logical structure by taking a 
pointer to a node from the shared stack, marking 
the node and placing its successors on the stack. 
This normally requires only one visit to each 
node. A highly interconnected list structure 
slightly modifies the behaviour of both algorithms 
in that several markers can stack pointers to the 
same node or can find grey nodes simultaneously. 


Although Stacking is a fast algorithm for a 
single marker it does not work well for multiple 
markers as a marker must have exclusive access to 
the shared stack while adding or removing a 
pointer to a node and this causes substantial 
delays. By contract, Chaining works efficiently 
in this case provided the average number of 
successors of each node is small (either a linear 
list or a 'curtain' in which root nodes have 
several successors but subsequent nodes have only 
one). One controlling parameter on the perform- 
ance of the Chaining algorithm is the size of the 
shared sub-root list. If this is large and each 
marker refills the list whenever it is not full 
then the Chaining and Stacking algorithms are 
closely related. 


Revised Algorithms 


A substantially improved version of the 


Lamport algorithm is obtained simply by allowing 
each marker to complete its sequential pass 
through its section of the node space before re- 
setting instead of resetting as soon as one marker 
has found and coloured a node. This has two 
advantages. Firstly, several markers may find 
"shaded' nodes on each pass, be able to mark them 
and shade their successors. Secondly, the 
successors to a node which is marked may them- 
selves be marked in the same pass through the node 
space. 


The introduction of a local stack for each 
marker with a smaller shared stack (analogous to 
the subroot list of the Chaining algorithm) 
enables the Stacking algorithm to utilise multiple 
markers effectively. Each marker refills the 
shared stack if it is not full and if no other 
marker is doing so, otherwise it uses its local 
stack. This minimises the time spent waiting for 
access to the shared stack while ensuring that 
markers always have work available. However, the 
space taken by the stacks can be quite large. 


Results 
Type of Structure Linear Baca Inter- 
List Connected 
% of occupied : 
nodes Low High High 
Number of markers 1 4 1 4 1 4 
Algorithm 
Lamport 2.50 1.50 54.00 18.40 53.80 12.30 
Modified 
Lamport 0.86 0.41 2.28 0.92 2.57 0.90 
Chaining 0.08 0.06 0.98 0.26 6.24 1.63 
Stacking 0.09 0.10 1.22 0.88 1.62 2.40 
Modified 
Stacking 0.07 0.05 0.70 0.23 1.60 0.42 


All algorithms were run on node spaces in 
which the successors have a random spatial 
distribution and the results obtained are in 
seconds of elapsed time. The single marker 
unmodified Stacking algorithm times are high 
because of the overhead of entering and exiting 
a protected region which is provided for the 
multiple marker case. The speed up of more than 
four for the Lamport algorithm for highly inter- 
connected structures is due to several nodes 
being marked ‘simultaneously’ reducing the number 
of passes required. 


The times given above were with no mutators 
active. Both revised algorithms, however, have 
been shown to be reliable in conjunction with 
mutators, with the effect on the time taken to 
mark depending upon the actions being performed by 
the mutators. 


Further Work 


.Two further aspects of multiprocessor 
garbage collection are currently being studied. 
The first is the efficacy of adding a compaction 
phase to the garbage collection process. The 
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second is a comparison of the marking algorithms 
with algorithms in which the number of pointers 
to a node is recorded in the node itself 
(reference count scheme). 
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EFFICIENCY OF FEATURE DEPENDENT ALGORITHMS 
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Abstract--In this paper the concept of feature 
(in)dependent image processing algorithms is defined. A 
large class of image processing computers characterized 
by multiple processor-memory subsystems is efficient 
when dealing with feature independent algorithms but 
less efficient when dealing with feature dependent algo- 
rithms. Typically such machines are required to perform 
both types of algorithms. This paper is a preliminary 
attempt to provide a framework within which to model 
feature dependent algorithms, and to, for example, quan- 
tify the inefficiency that can occur when they are exe- 
cuted on the above type of parallel image processors. 


Keywords--feature dependent algorithms, image pro- 
cessing, parallel processing. 


Introduction 


The economics of modern digital integrated circuit 
technology no longer restricts the designers of digital 
systems to the classical serial interpreter typified by 
the von Neumann uniprocessor architecture. This trend 
away from conventional machines is particularly well 
developed in the field of image processing where the 
large data sets (64K bytes to 4M bytes per image) and 
the high processing rates (near term predictions of 1 to 
100 billion operations per second have been made in 
[1]) make special purpose machines an economic neces- 
sity [2]. A number of people have proposed/constructed 
special purpose machines for image processing. These 
are surveyed in [3-5]. 

An architectural characteristic of most of these 
special purpose image processors is a large number of 
processors working in parallel. Parallel processing is a 
natural strategy for dealing with the large data sets and 
high processing rates encountered in image processing 
applications; furthermore, the nature of the data and the 
nature of many of the algorithms make parallel 
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processing particularly attractive. The data is usually a 
large two dimensional array, and many of the low level 
image processing algorithms can be decomposed into a 
large number of concurrent neighborhood operations. 
Examples include: various filtering algorithms such as 
smoothing to reduce high frequency noise and median 
filtering to reduce salt-and-pepper noise; edge detec- 
tion algorithms that use operators such as the Sobel 
operator and the Hueckel operator; and various coding 
algorithms such as block truncation coding and cosine 
transform coding. 


A natural architecture for the above class of image 
processing algorithms is a multiprocessor in which equal 
subimages are assigned to separate processors for pro- 
cessing. For the purpose of this discussion we will clas- 
sify such processors as muitiple subimage processors 
(MSP's). As might be expected, a large number of the 
Proposed/constructed special purpose image processors 
can be viewed as MSP's. Figure 1 shows a_ block 
diagram of a generic MSP. Subimage / is handled by its 
own processor-memory subsystem, processing element j 
(PE\). The PE's can communicate through some form of 
interconnection network (ICN). Specific examples of 
MSP's include: the proposed PASM architecture [6], 
which plans to employ multi-path routing-networks to 
connect a set of 1024 PE's; CLIP4 [7], a 96 x 96 array 
of simple bit-processors, each with a 32 bit RAM and an 
ICN that connects nearest neighbors in the array; the 
Distributed Array Processor [8], a 64 x 64 array of pro- 
cessors with 4K-bit storage per processor and an ICN 
that connects nearest neighbors in the array and pro- 
vides a bus per row and column; the Massively Parallel 
Processor [9], a 128 x 128 array of processors with 
1K-bit storage per processor and an ICN that connects 
nearest neighbors; and the Adaptive Array Processor 
[10], whose building block is a single chip 8 x 8 array 
with 96 bits of storage per processor. 


In general, MSP's are highly efficient at performing 
neighborhood operations such as those listed above. 
These types of operations are an important subclass of 
what we will term feature independent image processing 
algorithms. Feature independent aigorithms are charac- 
terized by equal processing per pixel. In other words, 
each pixel receives the same amount of processing 
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Figure 1, Generic MSP. 


regardiess of whether or not it is part of a feature of 
interest such as a line segment. As well as many neigh- 
borhood operations there are other algorithms such as 
histogramming and the Fourier transform which are 
feature independent. Unlike neighborhood operations 
these algorithms require significant amounts of data to 
be moved between processors. The effectiveness of 
MSP's at performing them is dependent on the bandwidth 
of the ICN shown in Figure 1. A multiprocessor like PASM 
with a high bandwidth ICN can perform such algorithms 
relatively easily [11-13]. Therefore, the concept of a 
multiprocessor in which equal subimages are assigned to 
separate processors for processing is also a natural way 
of handling the complete range of feature independent 
algorithms, provided the ICN is appropriate for the types 
of feature independent algorithms anticipated. 


Although the above concept is natural for feature 
dependent algorithms, it becomes less attractive for 
feature dependent image processing algorithms. Feature 
dependent algorithms are characterized by unequal 
amounts of processing per pixel. This might arise when a 
pixel is part of a feature of interest and because of that 
requires separate treatment. A simple example of a 
feature dependent algorithm is contour tracing; only 
edge pixels are involved in the algorithm. In an image 
processing application the initial sequence of algorithms 
involves mostly feature independent algorithms because 
they are concerned with general image enhancement and 
potential feature location. The subsequent sequence of 
algorithms is much more likely to involve feature depen- 
dent algorithms because specific features are sought 
from the set of potential locations. 


Consider processing an W-pixei image on an MSP 
machine having m PE's. In normal MSP operation the 
image is divided into N/m subimages of equal size, and 
each subimage is processed by a single PE. However, in 
the case of feature dependent algorithms the image 
should be divided into subimages of equal /nterest, i.e., 


subimages having equal numbers of pixels of interest. If, 


in the case of feature dependent algorithms images are 
divided into subimages of equal size, some PE's will 
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receive fewer pixels of interest. This uneven distriby- 
tion of work will result in some PE's being idle during part 
of the algorithm. Dividing the image into subimages of 
equal interest requires that the distribution of pixels of 
interest over the image can be calculated. This is not 
always possible. On the other hand, it may be possible, 
but the calculation and the redistribution on the basis of 
interest may involve more computation than that lost 
through the inefficiency of having some PE's idle during 
part of the algorithm. 


This paper is a preliminary attempt to provide a 
framework within which to model feature dependent 
algorithms, and to, for example, quantify the above inef- 
ficiency to assist in decisions about image distribution 
among PE's. 


The following section develops a mathematical 
model of feature dependent algorithms. Section 3 tests 
it using some real image data with edge pixels as the 
pixels of interest. Section 4 concludes the discussion. 


2. Mathematical Model of Feature Dependent Algo- 
rithms 


Consider an N-pixel image and an m-PE MSP sys- 
tem. Assume that the pixels of interest occur randomly 
in the image and that the probability of a pixel being of 
interest is p regardless of its position. Assume that the 
MSP system is executing an image processing algorithm 
on the image. Let the time to complete the algorithm be 
a function, f, of the number of pixels of interest in the 
image, ij.e., the algorithm is a feature dependent one. 


For the single PE case (m=1) the expected value 
of the execution time, 741, is given by: 


T, = f(Np) (1) 

For the m-PE case assume that the image is divided 
among the m PE's on an equal size basis. Each PE holds 
an n=N/m pixel subimage. Let xX; to be the random 
variable describing the number of pixels of interest in 
subimage /, /=1,2,....m. From the above assumption that 
the probability of a pixel being of interest is p regardless 
of its position, it follows that the X;'s are identically 
independently distributed (i.i.d) random variables with a 
binomial distribution (see Figure 2). 


Let Tmax be the expected value of the maximum 
execution time among all PE's. Since the algorithm is not 
finished until all the m PE's have completed the work in 
their subimage, it follows that: 


Tm = FCE[Xmax) (2) 
Where: 


Xmax = max(X4, X2, eeswece 3 Xm) (3) 


To evaluate 7,, consider the following. Let p; be the 
probability of exactly { pixels of interest occurring in 


-subimage /: 


Pr tx, = 33 = 0; =f] eG -pyr (4) 


Let q; be the probability of greater than / pixels of 
interest occurring in subimage /: | 


i-th subimage of n 
pixels with X. of 
i 
interest 


image of N pixel 
partitioned into 
subimages of n 
pixels 


—— 


Figure 2. A subimage and its associated random variable. 


Pr {Xj >jj=aqj)= Do Pr (5) 
r=j+1 
Then: 
q= 4 p C=py (6) 


r=j+1 


Let P(z) be the generating function for the sequence pj, 
j=0,1,,...,0: 


P(Z) = pPotpPiZt...... + ppz (7) 


Let Q(z) be the generating function for the sequence q;, 
j=0,1,...,: 


Q(z) = qo +qi1Zt....... +Q,z" (8) 
From (7) and (8) it follows that: 
6234S (9) 
1-z 


Equation (9) can be verified by equating the z coeffi- 
cients on both sides of the equation: 


1— P(z) = (1— z)Q(z) (Qn =0) (10) 
See [14]. 
Differentiating P(z) with respect to z yields: 
P'(z) = py t2poZtiuun. +np,z"—| (11) 
Evaluating P(z) at z=1 yields: 
P'(1) = Pi +2pot+ seveans +NPp (12) 


The right hand side of the above equation is simply 
E[X;]. Thus: 
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E[X;] = P'(1) (13) 


Differentiating both sides of (10) yields: 


—P'(z) = —Qz) + (1-2z)Q(z) (14) 
Evaluating (14) at z=1 yields: 
P'(1) = Q(1) (15) 
Comparing to (13) gives: 
E[X,] = QQ) (16) 
Next consider Pr{ Xmax <j 3: 
Pré Xmax </§ = Pr X1,<j and X2<2 (17) 
and X7<j} 
Since the X;'s are i.i.d, (17) reduces to: 
m 
Pr{ Xmax <1} = [Pri x) = /3] (18) 
For any /. 
Using the relation: 
Pr{ Xmax > f 3} = 1-Pr{ Xmax =f 3 (19) 
Gives: 
tia 
Prt Xmax > J} = 1-[Prt X) <i | (20) 
But from (16): 
E[Xmax] = Qmax(1) (21) 
And, by definition: 
Qmax(Z) = Pr’ Xmax > 13+ Pri Xmax>2)} (22) 
+...+ Pr} Xmax > ™ 3 
Therefore, substituting (20) into (22) gives: 
r m 
rn = 113 1-P x, <k3] | (23) 
=0 
Where Pr{ X; < k } is given by: 
kK 
prtx,<k3}= 9 E] p'-p)~ (24) 
r=0 


Notice tha the value of Tmax is independent of / because 
the X;'s are i.i.d. 


Following the usual arguments (see [15]) the effi- 
ciency £ can be defined in terms of 7; and 7,, by: 


M (25) 
MT in ; 


- = 


Thus the efficiency of executing feature dependent 
algorithms can be determined from (1), (23), (24) and f, 
the function that describes the time to complete the 
algorithm. 


3. Experimental Results 


in an attempt to test the above results the follow- 
ing experiment was carried out on a set of images of 
industrial parts. These images were obtained from the 
General Motors database for the industrial bin of parts 
problem [17]. The names of the ones used are listed in 
Table 1. 


The Sobel edge operator was applied to the above 
images. A pixel was defined to be of interest if and only 
if it was on an edge. The resulting image was thres~- 
holded and the number of edge points (number of pixels 
of interest) was computed. The threshold value was 
chosen to give a "good" edge image. All the images are 
256x256 with 256 gray levels. The number of pixels of 
interest in each image and the value of p are also shown 
in Table 1. The value of p was estimated as the number 
of pixels of interest divided by the total number of pix- 
els in the image. 


The images were divided into subimages of equal 
sizes and the expected value of the maximum number of 
pixels was obtained experimentally. The experimental 
value obtained was compared with its theoretical value 


obtained from equation (23) with f=1, for various values © 


of m. Those results are shown in Graph 2. It can be 
seen that there is a fairly good agreement between the 
theoretical results and the experimental results when 
the the features are edge pixels. The lower of the two 
curves is the theoretical one. This error is due the our 
assumption that the probability of a pixel being of 
interest is not related to its position. In the case of 
edge pixels this is clearly not so as they cluster in lines. 
Clustering moves the experimental line higher. 


In the case of specific features better results 
might be obtained if a more accurate stochastic model of 
the features distribution can be developed. For exam- 
ple, more accurate models of edge pixel distributions 
have been developed [18], however they apply only to 


edges and computing 7,4. for them appears to be a 
problem. 


image Name 


bin1.piv 
bin1 .piw 
bin3.piv 


bind.piz 
bin8.piv 
yoke .pit 
yoke2. pit 
yoke3.pit 
rod1.pit: 
bin1 .piw 


Table 1. 


372 


Graph 1 shows the variation of the efficiency, £, as 


a function of the ratio N/m for p = 0.2, 0.4, 0.6, 0.8.. 
The graph was plotted by assuming f to be linear. A more 
realistic function would depend on the specific feature 
dependent algorithm being considered. However, linear 
does appear to be a reasonable assumption for a large 
class of algorithms. 
cated feature dependent algorithm such as the General- 
ized Hough transform [16] is approximately linear: for 
each pixel of interest no more than a fixed number of 
accumulators have to be updated. 


For example, a relatively compli- 


If care is taken 7,, can be evaluated in O(n) time. 


The term from (24) should not be evaluated from scratch 
for each value of kK. Also, for large values of n the terms 
on the right hand side of (24) can be approximated by a 
Poisson distribution whose terms can 
evaluated using Stirling's formula and logarithms. 


in turn be 


Several points can be deduced from Graph 1. The 


efficiency tends to p as N/m goes to 1. This agrees 
with intuition: if there were as many PE's as pixels, p 
would be the fraction likely to contain an interesting 
pixel, and only this fraction would have any work. For 
very low values of p (<<0.2) the efficiency can drop 
drastically for MSP's processing images that have less 
than an order of magnitude more pixels than they have 
PE's. For example, PASM with 1024 PE's will operate at 
less than 40% efficiency on images of 64 x 64 pixels if 
p=0.4. On the other hand if the images are 256 x 256 
the efficiency jumps to over 80% for the same value of 
p. Clearly, for high efficiency the image should contain 
several orders of magnitude more pixels than the MSP 


has PE's. 


Efficiency E 


Ratio N-m 


Graph 1. E versus N/m 


ne 


137.6 


bs 118:3 
© 
x 
- 
a 
 —«-98. 8B 
ra 
ce) 
c 739,50 
x 
0 
& 

60.13 
Lee 
ca) 
5 
- 40. 7S 
lu 

21.38 


2.000 
0 2 % 6 3 to te t+ \6 


Number of PE's ¢ log m ) 
Graph 2. 


ee Tne nU NO UNTIN 


4, Conclusions 


This paper has presented a preliminary attempt to 
provide a framework within which to model feature 
dependent algorithms, and to, for example, quantify the 
inefficiency that can occur in MSP's when subimages of 
equal size are distributed among the PE's. 


The mathematical model was simple enough to allow 
key terms such as 7yax to be efficiently computed 
without compromising the accuracy of the result. Future 
work might examine how F can be determined if more 
complex, say Markov, models were used for the features 
of an image. 
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Abstract 


An attributed directed graph model which is a com- 
bination of high-level Petri nets and AND/OR graphs is 
described. This model provides a method for matching 
parallel algorithm to architecture or vice versa. The 
analysis of parallel computation using this model is 
described. Examples are given to demonstrate the 
descriptive power of this model and how it helps us to 
match an algorithm and an architecture. 


1. Introduction 


It is well known that parallelism is an efficient 
method to increase the speed of computation in both 
hardware and software systems. Despite the achieve 
ments in designing parallel computers and parallel algo- 
rithms there has been very little attention paid to study 
the relationship between them. As a result, some paral- 
lel algorithms may be more effective when executed on 
one parallel computer than the other. Wirsching and 
Kishi [1] reported their five projects in investigating the 
efficiency of highly parallel and highly concurrent com- 
puting systems for different problems. They concluded 
that the efficiency of solving one particular problem 
varies from different parallel computers. After extensive 
testing experiments, Deminet [2] pointed out that when 
the structure of an algorithm corresponds well to the 
structure of the computer. a close-to-linear speedup may 
be achieved. Hon and Reddy [3] formulated several 
valuable principles which outlined the type of algorithm 
that is more efficient on an architecture with certain 
specified features. Now the question arises as based on 
what information should we make the decision that 
chooses the most efficient computer system for a particu- 
lar parallel algorithm or vice versa. 

Kung [4] classified parallel algorithms in terms of 
three dimensions: concurrency control, module granular- 
ity, and communication geometry. Jones and Schwarz 
[5] also pointed out three spaces for parallel computa- 
tion; they are, the computation unit (granularity), com- 
munication patterns, and patterns of reference to data. 
Both papers [4, 5] informally classified parallel 
algorithms/architectures in terms of their characteristics. 
Cantoni and Levialdi [6] tried to match tasks in image 
processing to a parallel architecture. They defined 
“match” as the degree of exploitation of the system 
resources (including time) to obtain a specific solution. 
However, the term “degree” is not defined. Instead, they 
selected several coefficients for system resources and 
problem requirements. From these coefficients, Cantoni 
and Levialdi derived the equation for execution time, 
provided the task and architecture are specified. They 
claimed that execution time is useful in determining a 
matching value. Their work is a good start of formal 
analysis in matching an algorithm to an architecture. 
Nevertheless, we believe that system resources should 
not be the only parameter for measuring the degree of 
matching. Control flow, system layout, data movement, 
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etc. are also responsible for the overall system perfor- 
mance. In this paper, we intend to study parallel com- 
putation and the relationship between parallel algorithm 
and architecture. We first propose a graph model which 
is suitable for modeling both algorithms and architec- 
tures. Then the analysis of parallel computation is 
presented based on this model. Finally, examples are 
given to show the descriptive power of this model and 
how it helps to make decisions on the type of architec- 
ture we should use for a particular algorithm. 


2. Graph Model 


A number of models for parallel computation have 
been proposed [7-12, 19]. Among them the directed 
graph model has shown its capability in modeling both 
parallel algorithms and parallel architectures. It is 
advantageous to use the same model for both hardware 
and software since it simplifies the study of relationships 
between algorithm and architecture. As Peterson 
claimed in his paper [7] the Petri nets, a special type of 
directed graph, are ideal for modeling systems of distri- 
buted control with multiple processes occurring con- 
currently. Another major feature of Petri nets is their 
asynchronous nature and the nondeterminism in Petri 
net execution. Figure 1 [7] shows a Petri net model for a 
producer-consumer problem with one producer (places P, 
and P.) and two consumers (places Py, Ps, and Pg, P7). 
The items produced by the producer are passed to the 
consumers. This is modeled by place Ps, and the tokens 
“produced” by transition t. and “consumed” by transi- 
tions tz and t;. Tokens are moved by the firing of the 
transitions in the net. A transition must be enabled (all 
of its input places have a token in them) in order to fire. 
A transition fires by removing one token from each of its 
input places and generating one token to all of its output 
places. In Figure 1, for example, the transition t, is 
enabled while tz and ts are not. If t fires, the marked 
Petri net of Figure 2 results. In Figure 2, three transi- 
tions are enabled, t,, tz, and ts. Note that the place Ps 
is the same input place for both transitions tz and ts; 
therefore the firing of either tz or ts disables the other. 
Consequently, these two transitions are said to be in 
conflict. This conflicting situation leads to a nondeter- 
minism in Petri net execution. That is, the choice as to 
which transition fires is made in a nondeterministic 
manner which in turn is not modeled. The nondetermin- 
ism is a good feature from a model’s point of view, but it 
should not be used when modeling a deterministic algo-_ 
rithm. In other words, we need a deterministic graph 
model. 

In [12, 13] the authors introduce control nodes into 
the graph model in order to achieve determinism. For 
example, Figure 3 shows the Petri net of Figure 2 with 
two extra control nodes ¢,, cg. If ¢, and cy are mutually 
exclusive, i.e., one of them has a token but not both of 
them, they can solve the conflict between t, and t;. This 
arrangement has two disadvantages: Firstly, the control 
nodes will increase the complexity of the graph model. 
Secondly, the implementation of control nodes consti- 


tutes another step in the design process. We believe that 
the major cause of nondeterminism in the Petri net is 
that every transition and place in it cannot perform “dis- 
joint” operation. In fact, the disjoint concept is totally 
ignored in the Petri net model. There is another type of 
graph which can explicitly distinguish “joint” and “dis- 
joint” operations. This type of graph is called AND/OR 
graph [14]. Figure 4 illustrates the ability of AND/OR 
graph notation to solve the conflict in Figure 2. Note 
that in this figure, the token in P, can either go to tg or 
t; but not both. Nevertheless, in order to choose the 
appropriate destination a decision policy has to be 
imposed on Ps. 

The Petri nets which carry extra information on 
their places and transitiions have been discussed in [11, 
15, 16]. According to Genrich and Lautenbach [11] this 
type of high-level Petri nets adds a new dimension to the 
modeling power and complexity of Petri nets, namely 
the formal treatment of individuals and their changing 
properties and relations. Unfortunately, the high-level 
Petri nets do not cover the joint and disjoint conditions, 
but they do allow to associate expressions with places 
and transitions. In the following section, we present a 
new model which takes advantage of both high-level 
Petri nets and AND/OR graphs. 


3. Attributed Directed Graph (ADG) 


In the proposed attributed directed graph model 
there are two types of node, namely, operation node (0- 
node) and data node (d-node). These two types of node 
are equivalent to the transition and place in a Petri net, 
respectively. The extra information associated with the 
nodes is expressed in terms of attributes. This explains 
the nomenclature. The two basic nodes are defined as 
follows: 

Definition 1 — An operation node (0-node) is defined 
as the expression of a subtask. The subtask, depending 
on the given problem, may be as simple as an ADD 
operation or as complicated as calculating the distance 
between two strings. Each O-node has its attributes 
(OPR, OP, WM). OPR is the number of operands. OP 
is the operation or subtask assigned to this 0-node, and 
WM is the working memory space required by the opera- 
tion. 

The attributes of an O-node explicitly reveal the 
characteristics of the O-node and its relationships with 
others. To be more specific, OPR not only shows the 
number of operands required by the 0-node but also indi- 
cates that there must be connections between this 0-node 
and other nodes in order to obtain operands. OP 
represents the computation complexity of the 0-node and 
implies whether the operands are required by the opera- 
tion simultaneously or in a sequential fashion. WM 
reveals the complexity of memory space. 


Definition 2 — A data node (d-node) is defined as the 
place which holds the conditions of an 0-node or stores 
the consequences after an O-node. In other words, it is 
the place where the operation stores/fetches data 
to/from. The attributes associated with a d-node are 
represented as (ID, ORD). ID represents the number of 
various data that reside in this node and ORD specifies 
the order that this d-node is referenced which may be in 
either parallel or sequential fashion. 

The connection between two nodes is called an edge. 
Similar to SF-nets, an edge can only connect nodes of 
different types [9]. An edge also possesses attributes. Its 
attributes (V, MD) have the following meaning. V is the 


number of variables transmitted via this edge, and MD is 
the mode of transmission which may either be sequential 
or parallel. According to the AND/OR graph notation 
the joint and disjoint situations are reflected through 
edges as defined below. 


Definition 3 — An edge is called AND case edge 
when there exists an arc connecting this edge with other 
edges. An OR case edge does not have any connecting 
arc. An AND case edge requires that the information 
transmitted through them must be in parallel. An OR 
case edge, on the other hand, requires mutually exclusive 
transmission. The exact order of transmission through 
an OR case edge follows the direction of connected 0- 
node or d-node. 


Definition 4 — An attributed directed graph (ADG) 

is a four tuple (D,O,A,M.,) 

where (1) D is a finite set of d-nodes, 

2) 0 is a finite set of O-nodes, 

3) A C (Dx0) U (0xD) is a finite set of 

AND/OR case edges, 

and (4) M, is the initial marking of the ADG model. 
This marking is expressed by tokens (black 
dots). The movement of a token is governed 
by the firing rules which are defined below. 


Definition 5 — The firing rules [9]: 

(1) An 0-node is enabled if all of its input d-nodes hold 
at least one token and its output d-nodes are 
empty. 

(2) Any enabled O-node remains enable and may be 
fired at any time according to the operations of the 
0-node. 

(3) An 0-node is fired by removing one token from each 
member of the input d-nodes and add one token to 
each of the output d-nodes. At this point, the 0- 
node execution is complete. 


4. Attributes in Parallel Computation 


As mentioned earlier, many important features 
should be considered in a parallel computing environ- 
ment, such as control interconnection, routing, memory 
conflict, scheduling, synchronization, etc. In order to 
understand the nature of parallel computing, it would be 
better to start from its special features. In this section 
we choose five features as the attributes of a parallel 
computation. They are described in the following. 


4.1 Data Movement 


Every computing system could be considered as a 
data manipulator. That is, it receives data and after 
certain operations produces resulting data. Every algo- 
rithm is a straightforward procedure which controls the 
computing system and hence the data movement within 
it. In more detail, there are many subtasks in an algo- 
rithm and each of them requires input data and produces 
output data. For instance, an ADD operation requires 
two operands and produces the sum; a string distance 
calculation takes in two input strings and gives the dis- 
tance between them as the result. Usually, the data 
movement required by the subtasks is embedded in the 
algorithm. For instance, the previous two examples may 
appear in an algorithm like S = a+b and P = distance 
(x,y), where S and P are sum and distance, respectively. 
The algorithm tells what data we should use, but it 
never shows us where they come from. At the imple- 
mentation stage different architectures have different 
effects on obtaining data. For example, data may be 
broadcasted through a central control unit or exchanged 
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in the shared memory; it may also be delivered by data 
busses or by some direct data line connections. When 
executing an algorithm, the data movement is the other 
important part besides the direct calculations. We will 
describe data movement as regular or irregular, which 
refers to the connections between subtasks; and variant 
or invariant, which indicates whether or not the connec- 
tions change with time. 


4.2 Module Granularity 


Every subtask (or task module) decomposed from a 
given task requires some execution time. We call this 
elapsed time the module granularity (MG). In an algo- 
rithm, the MG is not as important as the overall com- 
plexity analysis, but it does affect the analysis indirectly. 
In a real execution, MG is an important factor for the 
problem of synchronization, memory contention, etc. It 
sometimes forces a designer to choose a different archi- 
tecture or to use a different control scheme. For 
instance, suppose that we have subtasks X and Y needed 
to be completed before we go on to subtask Z. X and Y 
require different execution time. A synchronizer is 
needed before subtask Z starts. On the other hand, if X 
and Y execute the same operation and they happen to 
fetch the same data at the same time, then the memory 
contention problem arises. 

In the above situations, we have to take the module 
granularity into consideration before we decide the par- 
ticular system to use or the type of hardware/software 
conflict resolver to implement. Here we classify module 
granularity as uniform or nonuniform, which indicates 
whether or not all the processes have the same computa- 
tion. We also call a MG either “large,” “small,” or “con- 
stant.” This is not a well-defined term; rather, it tells us 
the relative size of the module granularity when com- 
pared with each other. 


4.8 Communication Geometry 


When the task modules of a parallel algorithm are 
connected to represent their inter-module communica- 
tion, the geometric layout of the resulting network is 
referred to as the communication geometry. This 
geometric layout does not have to be identical to the 
data movement. The data movement identifies the 
source and destination in a data transaction. The same 
data transaction can be achieved on different inter- 
module connections as long as we provide proper routing 
paths. With routing ability, a rather simple communica- 
tion geometry can be obtained and henceforth reduces 
the complexity of the implementing hardware. Com- 
munication geometry is closely related to control over- 
head. That is, for a specific data communication, the 
direct connection network may require no or less control 
while the simple indirect connection network may need 
routing algorithm to manage the data movement. Since 
we are only interested in the relation between algorithms 
and architectures, routing is not discussed here. We 
classify communication geometry according to its 
geometric layout as irregular and regular. Among regu- 
lar it is further divided into interconnection switch 
(crossbar, perfect shuffle, etc.) and array network (1-D, 
2-D, etc.). 


4.4 Memory Space 


In algorithm analysis [17], both time and memory 
space are important factors in judging the efficiency of 
an algorithm. In terms of hardware, memory access is 
always slow and memory space occupies a large portion 
of area even with today’s IC technology. Although the 
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speed and dimension of a memory element have been 
improved drastically, it is still the most time-consuming 
and expensive part in any system. The common 
phenomenon with memory space is that if one prefers to 
put memory on the same chip with the arithmetic logic 
unit, he certainly can enjoy a fast memory access in the 
sacrifice of small memory space. On the other hand, by 
supplying secondary memory, one can increase memory 
space but suffers longer access time. Besides, in a system 
with large memory database, the management policy is 
another important factor in deciding the efficiency of a 
system. For the purpose of this paper, we classify 
memory space as local or global, which refers to its local- 
ity; and constant or nonconstant, which indicates 
whether or not the required memory space depends on 
the input data. We sometimes call the size of memory 
space small or large. This is not quantitatively defined; 
instead, it is a comparative term. 


4.5 Concurrency Control 


Control information is usually only implicitly shown 
in an algorithm, but it is essential to a hardware system. 
The control directs the computation sequence and 
assures the correctness. There are various ways to con- 
trol a parallel computing system. It can be a stored seg- 
ment of microcodes, system clock, special control logic, 
etc. Control is closely related to other attributes. In 
this paper, we only classify concurrency control as cen- 
tralized or distributed. Detailed control techniques are 
not considered at this point. 

The attributes mentioned above are by no means a 
complete description of parallel computing. The simple 
classifications of each attribute are only for the conveni- 
ence of this study, that is, to reveal the relationship 
between an algorithm and an architecture. These five 
attributes can easily be extracted from the proposed 
ADG model. In other words, the ADG model is capable 
of describing parallel computations. Figure 5 outlines 
the relation between ADG model and the attributes of 
parallel computation. For example, the structure of an 
edge defines the communication geometry; memory space 
is determined by the WM in an 0-node and the ID in a 
d-node; concurrency control is covered in the ORD of d- 
node and the OP of O-node; module granularity is 
decided based on the OP of 0-node; and data movement 
is directed by the OPR in an O-node and the edge. 
These relations are further elaborated in the following 
examples. 


9. Illustrative Examples 
Example 1 — In a general SIMD computer all of its pro- 


cessing elements (PEs) execute the same instructions 
which are sent from the control unit (CU). These PEs 
may exchange their data with each other through a com- 
munication network. The ADG model for SIMD com- 
puter systems is shown in Figure 6. From this model we 
can determine the nature of parallel computation in a 
SIMD system. 

The communication geometry (CG) is directly 
reflected by the hardware connections. It is easy to 
understand that in this case the CG 1s regular and it is a 
switching network type. The data movement is bounded 
by CG and hence is also regular. In fact, the data move- 
ment may be either variant or invariant depending on 
applications. This variation of data movement can be 
seen in the OPR of Inter-PEs’ O0-node. The module 
granularity (MG) depends entirely on the complexity of 
0-node. Speaking of the whole system, its MG is nonuni- 


form because of the existence of CU and Inter-PE. But 
if we only consider PEs, they have uniform MG. 

In Figure 6, the complexity of Onode is not 
specified since the associated operations vary from 
different applications. In general the operations of CU 
and Inter-PE are far more complicated than those of 
PEs. The CU not only controls PEs and Inter-PE, but 
also communicates with the external world. The Inter- 
PE communication network takes commands from CU 
and exchanges information between PEs accordingly. 
All the PEs execute the instructions which are broad- 
casted from CU one at a time. Therefore the operation 
of PE is considered to be a straightforward single step. 
Judging from the number of edges pointing toward 0- 
node and the operations associated with 0-node, we con- 
clude that, relatively speaking, the MG of PEs is small 
while the MGs of CU and Inter-PE are large. 

The memory space (MS) is closely related to d-node 
and QO-node. From the number of incoming edges, we 
can decide whether the memory space is local (no greater 
than two incoming edges) or global (more than two 
incoming edges). Based on this criterion, all the memory 
spaces (d-nodes) are local in this case. The size of MS 
can be roughly determined from the number of variables 
on the incoming edges. In Figure 6, edge 1 conveys 
instruction, data and CU’s control program, such as 
masking, routing, etc.; edge 2 transmits instruction and 
data; edge 3 transmits instruction, while edge 4 only 
transmits data. Therefore, a reasonable conclusion is 
that the CU memory is large and the PE memory is 
small. The concurrency control] is centralized as clearly 
expressed in Figure 6. The control overhead is not 
severe, since all the PEs are under the same control 
instruction (AND case), only the Inter-PE requires addi- 
tional control effort (OR case), and the ORD in the d- 
node of a PE is also simple. 

The summary of Example I and the analysis results 
for MIMD system and VLSI systolic array are listed in 
Figure 7. 


Example 2 — A parallel Earley’s parsing algorithm [18] is 
recorded below. 


Algorithm A. 
fori = 1 ton do in parallel 
t(i-1,1) = Y x* {a} 
for} =2 ton do 
fori = 0 to n-j do tn parallel 
begin 
[Scanner:] 
tity) = tii +j-l) x* {aj 4;} 
[Completer:]  — 
fork = 1 toj-1 do tn parallel 
t(iitj) = tiijit)j) U t(i,it+k) x* t(k+k,i+j) 
end 


This algorithm constructs an upper-triangular shape 
parsing matrix with each element denoted as_ t(i,j). 
Algorithm A is executed in a pipeline fashion, and if we 
define subtask P(i,j) as calculating 

t(ij) = 0GJ-1) x#faj} 

and for k=1 to j-l 

t(i,j) = t(j) U tik) x*t(k,j), 
we can represent Algorithm A by the ADG model as 
shown in Figure 8. It is easy to see that the data move- 
ment, as shown by the edges, is very complicated. Note 
that many of the edges are OR case edges, which means 
that the results of those subtasks do not have to be sent 
to all the destinations at the same time. For instance, 
t(0,1) is needed by subtasks P(0,2), P(0,3), and P(0,4). 
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These subtasks are activated one at each stage, which 
indicates that a single data path moving t(0,1) through 
these subtasks is capable of doing the job. However, a 
special arrangement has to be made for this simple con- 
nection in order to have a correct execution. This 
arrangement is described in [18]. A simplification of con- 
nections changes Figure 8 to Figure 9 which describes a 
regular connected network. In Figure 9 each subtask 
simply executes 

t(i,j) = t(i,j-1) x* {ay} 

and t(i,j) = t(i,j)U t(i,k) x* t(k,). 
As described in [18], this subtask can be implemented on 
a special hardware and executed in a small constant 
time. The control of this special hardware is simple and 
local (distributed). From the analysis above and the 
characteristics of different systems shown in Figure 7 we 
conclude that Algorithm A is suitable for VLSI systolic 
array implementation. 


6. Concluding Remarks 


In this paper, we describe the attributed directed 
graph model. This model is more powerful in modeling 
concurrent hardware and/or software systems. Further- 
more, this ADG model is a deterministic graph model 
which includes the control information in its attributes. 
We also study the features of parallel computation and 
express them in terms of the ADG model. Finally we 
use two examples to demonstrate the descriptive power 
of the ADG model and how this model can aid a 
designer to choose the proper algorithm for the proper 
architecture or vice versa. 
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Figure 3. Petri net with control nodes. 


Figure 2. 


The resulting Petri net from Figure 1. 
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Petri net with AND/OR notation. 


Figure 4. 
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Figure 5. The relation between ADG and the attri- 


butes of parallel computation. 
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Abstract -- Proper functioning of any parallel 
system depends on balance in the flow of informa- 
tion. A specific flow graph model for systems is 
presented using linear inequalities to characterize 
the terminal behavior of individual components of 
the system. These inequalities are combined into 
a homogeneous system of linear equations, whose 
solution reveals some of the global information 
flow properties of the parallel system. Several 
theorems are stated regarding characteristics of 
this global information flow in deadlock-free 
systems. 


Introduction 


There is a new interdisciplinary field called 
the Science of Creative Intelligence, which studies 
universal principles of orderliness and intelligence 
in human beings and in natural systems (1). Accor- 
ding to Maharishi Mahesh Yogi, the founder of the 
Science of Creative Intelligence, any system which 
expresses intelligence must have a coherent rela- 
tionship between the individual parts of the system 
(2). Maharishi's theory states that the level of 
intelligence depends on the degree of coherence and 
integrated functioning within the system (3). In 
the field of computer science, the individual com- 
ponents of any information processing system must 
function in a manner which produces a global coher- 
ence and balance throughout the entire system. This 
is especially true of systems with a high degree of 
parallelism, in which there are typically a large 
number of individual components, each having a high 
degree of independence. 


Flow Graphs 


For the sake of brevity and readability, this 
paper is rather informal in presentation. The ap- 
proach used in this research is similar in style to 
much of the work on properties of parallel control 
structures such as Petri-Net theory (4) and data 
flow schemas (5,6). For a more complete and formal 
analysis of the theory, the reader is referred to 
Lester (7). An information flow graph is defined 
as a directed graph, each of whose nodes contains 
an information processing module. The modules are 
of two types: Fixed modules and Union modules. 
Information packets flow along the arcs between 
modules, and each module has an internal state 
which determines its terminal behavior with respect 
to input and output of information packets. 


Our focus in this paper is not so much on the 
detailed structure or data values contained in each 
information packet, but on the total number of 
packets sent and received via a given arc. For 
that reason, the packets are treated a indistin- 
guishable, unitary entities. The total number of 
packets which have been sent along a give arc b of 
the flow graph is called the count on that arc and 
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is denoted |b - The arcs have no capacity for in- 
ternal storage and serve only as channels for pack- 
ets to flow between modules. The modules operate 
independently and interact with neighboring modules 
by sending or receiving packets along the connec- 
ting arcs (or terminals). Intuitively, a fixed 
module is one which maintains a fixed ratio for 
the counts on its connecting arcs. That is, if we 
look at the counts on the terminals of a fixed mo- 
dule over time, the ratios of the counts on differ- 
ent terminals will converge to a constant. More 
formally, for each pair of terminals (a,b) of a 
fixed module, there exists positive rational con- 
stants C,K,R such that 

C75) oS Raye Kk 
There are many different types of typical system 
components which can be modelled as fixed modules. 
Some examples are shown in Fig. l. 


Fixed modules are defined above as having a 
fixed ratio for the counts at every pair of termi- 
nals. The other type of module is the union module 
which has a sum property with respect to its ter- 
minals: the sum of the input counts is equal to 
the sum of the output counts. More formally, for 
any union module with input terminal set A and out- 
put terminal set B, there are positive integers C 
and K such that 

-c< © |b] - © lal <x 
beB acA 

Intuitively, a union module is a storage facility 
information packets with some finite maximum capa- 
city. Some examples of typical components or para- 
llel systems, which can be modelled as union mod- 
ules are shown in Fig. 2. An arc in a flow graph 
is defined as dead if no further information 
packets can flow through it. A graph is deadlock- 
free if there is no reachable state with a dead arc. 


Current Law Equations 


Union modules and fixed modules together have 

a broad range of modelling power, as illustrated 
by the examples. Now we will present a simple but 
useful mathematical technique for analyzing the in- 
formation flow properties of any flow graph consis- 
ting of fixed modules and union modules. With each 
arc b of the flow graph, let us associate a current 
Now define the following current laws 
for the modules: 
1. Union modules - 

sum of input currents = sum of output currents 
2. Fixed modules - for each pair of terminals 

(a,b) with fixed ratio R, i, = Ri, 
Intuitively, we may think of the currents as a 
measure of the information flow along the arcs. 
Current law 1 for union modules is just the famil- 
iar Kirchhoff's Current Law of electrical network 
theory. Knuth (8) and Deo (9) have noted the use- 
fulness of KCL for analyzing the properties of 


normal sequential flow charts. However, current 
law 2 is a new law that is necessary for more come 
plex parallel systems. For any information flow 
graph with n terminals, the current laws define a 
homogeneous system of linear equations Ai =O, 
where i = (on eekoe emery bee and Ais anm by n matrix. 


Theorem 1 - For any deadlock-free information flow 
graph, the current law equations have a positive 
integral solution. 


The proof method for Theorem 1 is to notice that 
the current laws for fixed and union modules are 
taken directly from the linear inequalities that 
define the modules.. From these linear inequalities, 
we know -C < Ax < K , where x is the vector of 
counts at the terminals. Since the flow graph is 
deadlock-free, the counts can grow arbitrarily 
large, so A must be noninvertible and Ai=O has a 
solution. For an example of a simple flow chart 
computation, whose current law equations have no 
solution, see Fig. 3. From Theorem 1, we know the 
flowchart must contain a deadlock, which is clearly 
seen by inspection, 


Independent Currents 


In the case that the current law equations do 
have a solution, a great deal may be learned about 
the global information flow in the graph from the 
properties of the solution. From standard tech: 
niques of linear algebra, the homogeneous system 
of linear equations Ai = O has a general solution 
1 B(ly,sigr---+r1y)s where k = n - rank A and (1), 
19,--+-+1,) are arbitrary constants, which we call 
the andependent currents. <All of the terminal cur- 
rents of the flow graph may then be expressed as a 
linear combination of these k independent currents. 
For example, the computation of Fig. 4 involves 
communication between processes uSing mailboxes and 
Send-Receive primitives. There are 5 union modules 
and 5 fixed modules. The current law equations 
result in a solution with two independent currents 
1, and ig as shown. 


From the independent current solutions for 
Fig. 4, it is seen that the T branch of each Deci- 
der has the same current. This means that in order 
for the whole computation to be deadlock-free, both 
Deciders must make the same percentage of T choices 
(subject to fixed deviations depending on the ini- 
tial contents and capacity of the mailboxes). If 
each Decider is assumed to be independently making 
arbitrary free choices, then a communications dead- 
lock may result. In order to make it easier to 
analyze properties of porgram flow graphs, it is 
common in the theory of program schemas to consi- 
der Deciders as free to make independent choices. 
A specific assignment of choices for each Decider 
is usually called an interpretation. 


Theorem 2 - If an information flow graph is 
deadlock-free (for all interpretations), then 
no. of independent currents < no. of Deciders. 


The proof of this theorem is too complex to be in- 
cluded here, but can be found in Lester (7). How- 
ever, the substance of the result is quite simple 
and intuitive. If a Decider is free to make 
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independent choices, then there must be one degree 
of freedom in the overall flow of control to accom- 
odate that Decider. The number of independent cur- 
rents in the solution to the current law equations 
represents the total number of degrees of freedom 
in the global flow of information. If the system 
is to be deadlock-free, these degrees of freedom 
must exceed the number of Deciders. In Fig. 4, 
there are only two independent currents en and 15) 
for two Deciders; thus, by Theorem 2, the overall 
computation will have a deadlock for some interpre- 
tation of the Deciders. 


Conclusions 


In this paper, we have presented a simple 
mathematical technique for analyzing the global 
information flow properties of parallel systems. 
The technique relies on the standard procedures of 
linear algebra for solving a homogeneous system of 
linear equations. Thus, this analysis procedure 
could easily be automated into a compiler for para~ 
llel programs or into any parallel system design 
tool. If a parallel system does not meet the con- 
ditions specified by Theorems 1 and 2, then it has 
a potential deadlock. 
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Abstract 


Virtual time is a broad, new paradigm for 
organizing and synchronizing distributed systems, 
subsuming such heretofore distantly related pro- 
blems as distributed discrete event simulation and 
distributed database concurrency control. It is 
an abstraction of real time in much the same way 
that virtual memory is an abstraction of real 
memory, and it reorganizes the concepts of con- 
currency and synchronization in a manner similar 
to the way virtual memory reorganized the subject 
of memory management. Virtual time systems can 
be implemented using the Time Warp mechantsm, a 
distributed synchronization mechanism that is dis- 
tinguished by its wholesale commitment to Zook- 
ahead-rollback as its primary synchronization tool, 
but its implementation of rollback through antt- 
messages, and by its global coordination through 
the concept of global vtrtual time. 


1. Introduction 


This paper is a short introduction to a new 
theoretical paradigm for distributed computation 
called virtual time, and to its implementation, 
the Time Warp mechantsm. The virtual time para- 
digm is a method of coordinating distributed 
systems by imposing on them a temporal coordinate 
system more computationally meaningful than real 
time, and defining all notions of synchronization 
and timing in terms of it. The virtual time 
scale need not be closely related to real time, 
but it is still "temporal" because virtual time 
increases as the computation progresses, because 
events at the same vitual time act as parts of a 
Single action atomic with respect to actions at 
other virtual times, and because one can reason 
correctly about vitual time relations such as 
"before" and "after" by using ordinary 'Newtonian" 
intuition. The more difficult "relativistic" 
reasoning [Lamport 78] required to understand the 
‘real time relations "before'' and "after'' in dis- 
tributed systems is unnecessary. 


The Time Warp mechanism is a distributed pro- 
cess control regime that implements virtual time, 
in the same way that paging or segmentation 
mechanisms implement virtual memory. Its distin- 
guishing feature is its wholesale commitment to 
process lookahead and rollback as the fundamental 
synchronization mechanism. Although relying on 
rollback may unorthodox step, it is also liberat-. 
ing. Many synchronization mechanism issues be- 
come much simpler once the possibility of roll- 
back is considered. Of course at first glance 
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rollback seems more expensive both in space and 
time than other synchronization mechanisms such 
as process blocking, so it is essential to 

argue for each use either that rollback will be 
rare or that any other synchronization mechanism 
would incur similar overhead anyway. Such argu- 
ments can be made on the basis of temporal local- 
ity assumptions about the dynamic behavior of 
programs analogous to the spatial locality 
assumptions underlying paging systems but these 
assumptions have yet to be tested. 


These two notions, the virtual time paradigm 
and the Time Warp mechanism, offer new ways to 
think about distributed computation. In particu- 
lar the following results derive from them, 
although the justifications can only be outlined 
in this paper. 


_-Discrete event simulation can be viewed as 
an application of the virtual time paradigm. 
The Time Warp mechanism provides a new 
method for high-concurrency discrete event 
Simulation that is transparent to the pro- 
grammer and free of deadlock and starvation 
[Jefferson 82],[Jefferson 83a]. 


-Distributed database systems can also be 
viewed as virtual time systems, in which 
case the Time Warp mechanism is a con- 
currency control and crash recovery mecha- 
nism, again yielding high concurrency with- 
out starvation or deadlock [Jefferson 83b]. 


-The Time Warp mechanism suggests a rethink- 
ing of synchronization in distributed 
systems. Most synchronization is based 
on the ability to block a process and 
restart it later, sometimes augmented 
with the ability to abort an action in 
progress. But the Time Warp mechanism 
seems to offer the first wholesale use 
of process roliback as the basis of 
synchronization, making new protocols 
and strategies possible. 


In the next section we will describe the concept 
of virtual time and the semantics of virtual 
time systems. In Section 3 we will compare our 
views to those of other theorists in the field of 
distributed systems. Section 4 we will describe 
the Time Warp mechanism, both its local and its 
global parts, In Section 5 we will give three 
examples of paradigms that become unified when 
viewed as virtual time systems. Section 6 gives 
the extended comparison between virtual time and 
virtual memory that has been a central focus of 


this research. Finally, Section 7 offers some 
future directions. 


2. Virtual time 


A vtrtual time system is a distributed system 
that executes in coordination with an imaginary 
global virtual clock that ticks virtual time. 
Virtual time is a temporal coordinate system used 
to measure computational progress and define syn-- 
chronization. Viewed abstractly a virtual clock 
always progresses forward (or at least never back- 
ward) at a pace that may either be closely bound 
to real time or completely independent of it. We 
assume that virtual times are real values (and 
o ), totally ordered as usual by the relation <. 


We envision systems of hundreds or thousands 
of processes all executing concurrently on a net- 
work of many processors. It is useful to con- 
sider each process as occupying a "point" in 
virtual space, and its unique name as its spatial 
coordinate. Every primitive action executed by a 
system can thus be assigned both a virtual time 
and a virtual space coordinate, and the set of 
all actions that take place at the same virtual 
place « and virtual time ¢ are collectively re- 
ferred to as the event at (x,t). 


Processes communicate by exchanging messages 
stamped with the name of the sender, the virtual 
send time, the recetver, and the virtual recetve 
time. The virtual send time is the virtual time 
at the moment the message is sent; and likewise 
the virtual receive time is the virtual time when 
the message must be received. We can also say, 
equivalently, that a message is stamped with the 
coordinates of both the sending and receiving 
events. A message is simply the transfer of 
information from one "point" in virtual space- 
time to another (like a photon in physics). 


The interaction of processes, messages and 
virtual time is subject to two semantic rules: 


Rule 1: The virtual send time of a 
message must be less than or equal to 
its virtual receive time. 


Rule 2: All messages directed to a 
particular process must be processed 
in nondecreasing virtual receive time 
order. 


These restrictions, similar to Lamport's Clock 
Conditions [Lamport 78], embody our desire that 
the arrow of causality be pointed in the direction 
of increasing virtual time. For convenience we 
adopt two further rules in this paper: 


Rule 3: No two messages directed to the same 
process have the same virtual receive time. 


Rule 4: Events not involving the 
receipt of a message are null, 
1.@€., no-ops. 


Rule 3 has the effect of changing the work "non-. 
decreasing" to "increasing" in Rule 2. Rule 4 
removes from consideration spontaneous state 
changes and message sending that are not prompted 
by receipt of a message. The system must there- 
fore be driven (or at least initiated) by mes- 


sage(s) from an "outside" process. Many interest- 
ing issues arise when these latter two rules are 
relaxed, as is usually necessary, but they would 
unduly complicate our discussion. 


A non-null event at (x,t) consists of the 
following three actions executed sequentially: 


1. Process x receives the message stamped 
with receiver x and virtual receive 
time t. 


2. It updates its state accordingly. 


3. It sends zero or more messages stamped 
with sender x and virtual send time 
t. 


The semantics of virtual time are extremely 
Simple. 


If an event A has a virtual time less than 
that of event B, then the execution of A and 
B must be scheduled so that A appears to be 
completed before B starts. 


Even though A is earlier in virtual time than B, 
an implementation need not actually perform A 
before B. It may achieve better performance 

by scheduling A concurrently with B or even after 
it, as long as this fact is not detectable by any 
tests within the virtual time system. 


A consequence of this semantic rule is that if 

A and B have exactly the same virtual time (even 
if they occur at different places) they appear to 
be components of a single atomic operation that 

is indivisible with respect to events at other 
virtual times, because all events at earlier 
virtual times must appear to have been completed 
before either A or B starts, and all events at 
later virtual times must appear not to have start- 
ed until A and B are complete. Note that if A 
and B do have the same virtual time coordinate 
there are no restrictions on their scheduling. 


There are several degrees of freedom in the 
design of virtual time systems. The virtual time 
scale may be discrete or continuous (although here 
we assume continuous). It may be partially or 
totally ordered (we assume totally). It may be 
derived from real time or be independent of it (we 
will give examples of both). Virtual times may be 
visible to programmers to be manipulated as values, 
or they may be hidden from them and manipulated 
implicitly (exactly as with their spatial counter- 
parts, virtual addresses.) The virtual times 
associated with events may be explicitly calculated 
by user programs or assigned by fixed default rules 
(again, we give examples of both). And, there are 
many conventions that may be established to relax 
the restrictions of Rules 3 and 4. Each set of 
choices defines a different kind of virtual time 
systems, but all are similar enough that a unified 
approach to the theory and implementation is appro- 
priate. 


3. Comparison to other work involving artificial 
time scales 


Recently there have been a number of proposals 
published for synchronizing distributed systems 
using artificial time scales. We now briefly con- 
trast three of them with the virtual time paradigm 


and the Time Warp mechanism. 


3.1 Lamport's work 


Lamport [Lamport 78] seems to have been among 
the first to recognize that our understanding of 
real-time temporal order, simultaneity and causal 
relations between events in a distributed system 
bears a strong resemblance to our understanding 
of the same concepts in special relativity. In 
particular he showed that the temporal relation- 
ships that are operationally definable within a 
distributed system form only a partial order in- 
stead of a total order, and that "concurrent" 
events are incomparable under that partial order. 
He further showed that it is always possible 
effectively to extend this partial order to a 
total order by defining a system of artificial 
clocks, one clock for each process, that label 
each event with a value from a totally ordered 
set in a manner consistent with the partial order. 


We can describe Lamport's work as starting 
from a particular execution of a distributed system 


and ending with an assignment of totally ordered 
clock values to the events of that execution. 
With virtual time we are doing the exact reverse. 
We assume that every event is labelled with a 
clock value from a totally ordered set (the vir- 
tual time scale) in a manner obeying Lamport's 
Clock Conditions. The problem addressed by the 
Time Warp mechanism is to find an execution that 
is both consistent with this labelling and that 
exhibits high concurrency. 


3.2 Reed's work 


In his study of distributed systems Reed 
invented the notion of pseudottme, which bears 
a strong superficial resemblance to virtual time 
but which has a very different motivation and 
implementation. Reed is primarily interested in 
implementing distributed atomic actions, and his 
work has been used as the basis for the design of 
multiversion timestamp order mechanisms for trans-~— 
action concurrency control in distributed data- 
bases. Virtual time, by contrast, has as its 
goal the creation of a temporal coordinate system 
in which distributed computation is embedded, 
It was inspired by analogies to physical space 
and time and to virtual memory, Atomicity is not 
a goal per se, but it is a synchronization effect 
that can be arranged trivially within virtual time, 


Two actions occuring at different pseudo- 
times may be committed in either order. Reed's 
mechanism ''attempts"' to manage execution so that 
the action with the earlier pseudotime wiil be 
committed earlier, but this is only a heuristic. 
With bad luck or timing the action with the ear- 
lier pseudotime may have to be aborted and re-. 
tried later with a later pseudotime after the 
other action is complete. But with virtual time 
there is no abortion and no retry with a new time- 
stamp; synchronization is done by rollback and 
actions must be executed in virtual time order. 
One cannot specify in which order two atomic 
actions with different pseudotimes are to be 
executed; but one is forced to specify that order 
for different virtual times. One last difference 
is worth noting. Because Reed's mechanism uses 
abortion and retry for synchronization, and there 
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is no limit to the number of times an action may 
have to be retried, starvation is a potentially 
serious problem. There is no corresponding hazard 
with virtual time and the Time Warp mechanism. 


3.3. Schneider's work 


Schneider has done a more general study of 
synchronization [Schneider 82], in which he pre- 
sents a general mechanism for implementing essen- 
tially any synchronization protocol in a distri- 
buted environment. His technique is based on 
broadcasting all synchronization-related messages 
("phase-transition messages") to every process in 
the system, with every process in turn braodcast- 
ing its acknowledgement to all other processes, 
thus making tham all aware of every synchroniza- 
tion in the system. All such messages and acknowl- 
edgements are timestamped with values from a valid 
clock system such as Lamport's so that all pro- 
cesses agree both on what synchronization messages 
have been broadcast, and on the logical order in 
which they were broadcast. Broadcasting all 
synchronization messages and acknowledgements is 
apparently logically equivalent to keeping syn- 
chronization information in a globally accessible 
shared memory. 


Under the assumption of reliable, order- 
preserving message delivery Schneider also shows 
that each process may make synchronization deci- 
Sions (such as whether to proceed or stay blocked) 
locally, based on the set of "fully acknowledged" 
messages it has received. A message is considered 
full acknowledged at process P if P has received 
it as well as copies of the acknowledgements to. 
it from every other process in the system. The 
importance of recognizing when a message m is 
fully acknowledged is that the receiver is then 
guaranteed that it will never again receive a 
message or acknowledgement with a timestamp ear- 
lier than that of m. 


Schneider's mechanism can be compared to 
virtual time in that it does assign temporal 
coordinates to some of the actions in a distri- 
buted system, namely the synchronization actions. 
But where the Time Warp mechanism is extremely 
"liberal", making synchronization decisions on a 
provisional basis and rolling back when they turn 
out to be wrong, Schneider's mechanism is extreme- 
ly "conservative'', waiting to make such decisions 
until such time as it can be proved that they can- 
not be wrong!, One disadvantage of Schneider's 
mechanism is that it seems to be limited to systems 
with only a few processes; it does not scale up- 
ward smoothly to thousands of processes because 
of the prohibitive amount of message and acknowl- 
edgement processing inherent in mechanisms rely- 
ing on broadcast. The Time Warp mechanism seems 
to have no such barrier to indefinite scale-up. 


4, The Time Warp mechanism 


The Time Warp mechanism is defined without 
reference to any underlying computer architecture 
and can run efficiently on many multiple processor 
systems, from a tightly coupled multiprocessor 
such as C.mmp[Wulf 81] or Cm*[Swam 77], to a local 
area network connected by Ethernet[Metcalfe 76]. 


This observation is due to J.C. Brown... 


We assume that message communication is reliable, 
but we do not assume that messages are delivered 
in the order sent; in fact such a protocol would 
be wasteful because messages are not generally 
processed exactly in sending order. 


For correct implementation of virtual time 
in accordance with Rules 1-4, it is necessary and 
sufficient that at each process messages are ac- 
cepted in virtual receive time order and events 
are executed in virtual time order. It is un- 
necessary, and generally undesirable, for an im- 


plementation to require that all processes progress 


through virtual time at the same rate with re- 
spect to real time, or to require that at each 
instant of real time all processes must be ex- 
ecuting events of the same virtual time. In 
general, some processes may be ahead in virtual 
time and others may lag behind. It is not obvious 
how this criterion can be met in an efficient im- 
plementation because messages will not generally 
arrive at their destinations in virtual receive 
time order. Furthermore it is impossible for a 
process, on the basis of local information alone, 
to block and wait for the message with the "next" 
virtual receive time because, since we assume 
virtual times to be real numbers, no matter which 
one is presumed to be next it is always possible 
that another message with an earlier virtual 
receive time will arrive later. This is the 
central implementation problem that the Time Warp 
mechanism addresses. 


The Time Warp mechanism has two major parts, 
the local control mechanism, concerned with mak- 
ing sure that events are executed and messages 
received in correct order (providing a 'weakly 
correct" implementation of virtual time), and the 
global control mechantsm, concerned with global 
issues such as space management, flow control, 
I/O, error handling and termination detection 
(contributing to its "strong correctness"). 
discuss these in turn. 


4.1. The 


Under 


We 


local control mechanism 


the Time Warp mechanism the world is 
viewed as a collection of processes that communi- 
cate with one another via messages. Although 
abstractly there is a single global standard of 
virtual time, there is no global virtual clock 
variable in the implementation that is accessible 
to processes; instead each process has its own 
Local vtrtual clock variable that it may read. 

At any moment some local virtual clocks will be 
ahead of others, but this fact is invisible to 
the processes themselves because they can read 
only their own virtual clock. The virtual send 
time of a message is always copied from the 
sender's virtual clock. The virtual receive 

time may be assigned by any one of a variety of 
conventions. All interactions between pro- 
cesses are by message, including such things 

as input/output and process creation. 


Because it is impossible to wait for the 
"next'' message each process executes continuously, 
processing those messages that have already 
arrived in increasing virtual receive time order 
as long as it has any messages left. All of its 
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execution is provisional, however, as it is 
stantly gambling that no message will every 
with a virtual time stamp less than the one stamp- 
ed on the message it is now processing. As long 
as it wins this bet execution proceeds smoothly. 
The novelty of the Time Warp mechanism is that 
whenever the bet is lost the process must pay by 
rolling back. We might describe the situation 
differently by saying that each process is con- 
stantly doing a "lookahead" in its input message 
queue, and in any kind of lookahead scheme there 
are certain comparatively infrequent contingen- 
cies that require the undoing of some work al- 
ready accomplished. Lookahead is successful how- 
ever, when the overhead of occasional undoing is 
outweighed by increased performance the rest of 
the time.2 


con- 
arrive 


In most practical situations there will be 
more processes than processors, so only a subset 
of the processes can run at any one moment. The 
natural scheduling rule is to always run those 
processes whose local virtual clocks are farthest 


behind, On a multiprocessor this means always 
running the » farthest behing processes, where 7 
is the number of processors available. On a net- 


work it means always running, on each processor, 
the farthest behing process residing on that pro- 
cessor. 


A process has a single input queue in which 
all arriving messages are stored in order of in- 
creasing virtual receive time. Ideally the ex- 
ecution of a process is simply a cycle in which it 
executes its events in increasing virtual time 
order. An event consists fo the following three 
actions executed sequentially. 


1. A process receives the message from the 
input queue whose virtual receive time 
is the same as the time in its local 


virtual clock. 


It performs the actions appropriate in 
response to the message, 


It updates its local virtual clock to 
the time of the next message in the in- 
put queue (or to + infinity if there 

is none). 


4. and repeats (or ''terminates'" if the 


virtual clock reads ~), 


Whenever there is no next message a process ter- 
mitnates, but its state is not destroyed because 

it may later roll back and unterminate. This 

ideal scenario applies as long as no message ever 
arrives so "late'' that the receiver's virtual 
clock has a value greater than the virtual receive 
time stamped on the message, i.e. arrives with a 
virtual receive time in the "past''. But this is 
bound to happen occasionally for any of several 
reasons, e.g. because the sender's virtual clock 
has not advanced as far as the receiver's or be- 
cause of transmission delays in the network trans- 
port mechanism. While the incidence of late mes- 
sages may be low (and can be made as low as desir- 


: I am indebted to Tim Standish at Irvine for 


this characterization. 


ed by introducing artificial delays), it cannot 

be reduced to zero because it is fundamentally de- 
pendent on the vagueries of computation speeds and 
transmission delays. 


Whatever the reasons for the late arrival of 
a message, the semantics of virtual time demands 
that messages be received by each process strictly 
in virtual receive time order, and the only way 
to accomplish this is for the receiver to roll 
back to an earlier virtual time, cancelling all 
intermediate side-effects, and begin to execute 
forward again, this time receiving the late 
message in its proper sequence. 


Rollback in a distributed environment is 
complicated by the fact that the process in ques- 
tion may have sent any number of messages to other 
processes, causing side effects in them and lead- 
ing them to send still more messages to still more 
processes, and so on. Some of those messages may 
have requested output or some other irreversible 
action (dispense money, launch missile). Some of 
them may be physically in transit and therefore 
out of the system's control for finite but arbi- 
trary durations. The paths followed by these 
direct and indirect messages from process to pro- 
cess may not form a tree, but may converge or 
even contain cycles, leading to worries about 
infinite loops or deadlock in any rollback mechan- 
ism. Nevertheless, all such messages, direct or 
indirect, in transit or not, causing output or 
not, must be effectively "unsent" and their in- 
direct side effects, if any reversed. The Time 
Warp rollback mechanism is able to accomplish all 
this quite efficiently, and without stopping any 
part of the system. 


4.2 Antimessages and the rollback mechanism 


The name "Time Warp" derives from the fact 
that the virtual clocks of different processes 
need not agree, and the fact that they go both 
forward and backward in time. Over a lengthy 
computation each process may roll back many times 
while generally progressing forward. The fact that 
virtual clocks are sometimes set back does not 
violate our intention that "abstractly a virtual 
clock always progresses forward (or at least never 
backward)" because rollback is completely trans- 
parent to the process being rolled back. Pro- 
grammers can write software without paying any 
attention to the possibility of messages arriving 
late, and even without any knowledge of the issue 
at all, just as they can write without any atten- 
tion to, or knowledge of, the possibility of page 
faults in a virtual memory system, © 


To understand the rollback mechanism we must 
understand the nature of processes, messages and 
antimessages. In Fig. 4.1 we see the structure 
of a process named A, The blank fields in the 
figures are fields whose values are irrelevant in 
this description. A process is composed of: 


1. a process name (virtual space coordinate), 
which must be unique in the system. 


2. a local virtual clock, which in the fig-. 
ure reads 181, indicating that the mes- 


sage with receive time 181 is being pro- 
cessed. 


Process Name (Virtual Space Coordinate) 


181 § Virtual Clock (Virtual Time Coordinate) 


State Queue 


Input Queue 


virtual receive time 
receiver 
virtual send time 


sender 
sign 


message text 


virtual receive time 


receiver 
virtual send time 
sender 


sign 
message text 


Figure 4-1: 


Process A with input queue, output 
queue and state queue. . 
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Virtual Clock (Virtual Time Coordinate) 
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Antimessages 


Figure 4-2: Process A rolls back to time 135 
and sends antimessages. 


3. a state which in general is the entire 
data space of the process, including 
its execution stack, its own variables, 
and its program counter. We show here 

only a few own variables to represent 
the whole state. 


4. a state queue, containing saved copies 
of the process's recent states. As we 
shall see when we discuss the global 
control mechanism, it is not necessary 
to have states saved all the way back to 
the beginning of virtual time, but there 
must be at least one state saved that is 
older than global virtual time (GVT). 

In this figure GVT is assumed to be 100. 


5. an tmput queue containing all recent 
incoming messages sorted in order of 
virtual receive time. Some of these 
messages have already been received 
because their virtual receive times are 
less than 181. Nevertheless they are not 
deleted from the queue because it may 
be necessary to roll back and reprocess 
them. Other messages with virtual re- 
ceive times greater than 181 have not 
yet been received, or else they have been 
received and "unreceived" and equal number 
of times. Only incoming messages whose 
virtual send time is greater than or equal 
to GVT must be saved. 


6. an output queue containing copies of the 
messages it has recently sent, kept in 
virtual send time order. They are needed 
in case of rollback, for it will then be 
necessary to know which messages have 
been sent and must be "unsent"’. Only 
messages whose virtual send time is 
greater than or equal to GVT need be 
saved, 


Consider now the situation that arises if a 
message with virtual receive time 135 arrives, as 
in Fig. 4.2. It is apparent that all of the work 
that was done by this process since virtual time 
135 (in fact, since 121, the latest event earlier 
than 135) must be undone by rollback. The first 
step in the rollback mechanism is simply to search 
the state queue for the last state A was in before 
time 135 and restore it. We also restore 135 as 
the value in A's local virtual clock. After this 
we can discard from the state queue all states 
saved at or after time 135 and start A executing 
forward again. However, we still must correct 
for the fact that between time 135 and 181 A sent 
several messages to other processes that must now 
be "unsent". We accomplish this through an ex- 
tremely simple device: anttmessages. 


Every message has an antimessage that is ex- 
actly like it in format and content except in one 
field, called its stgn. Two messages identical 
except for opposite signs are called antimessages 
of one another. All messages sent by user pro- 
grams have a positive (+) sign; their antimes- 
sages have a negative (-) sign and are only 
created during rollback. Negative messages are 
manipulated exactly as positive messages are. 
When a negative message is sent it is enqueued 
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destination before the posttive one. 


in the output queue of the sender according to its 
virtual send time, and a copy is delivered to the 
destination process and enqueued in its input 
queue according to its virtual receive time. A 
negative message causes a rollback at its destin- 
ation if its virtual time is less than or equal 
to the receiver's virtual clock time, just as a 
positive message would. 


What makes antimessages so useful is the 
pecular queueing discipline they follow. When- 
ever a message (positive or negative) and its 
antimessage occur in the same queue, they immedi- 
ately annthtlate one another. Thus, the result 
of enqueueing a message in the output or input 
queue of a process can be to shorten it by one 
rather than lengthen it. It does not matter which 
message, negative or positive, arrives at the 
queue first; if and when the second one arrives 
the annihilation takes place. It is perhaps 
unnecessary to point out that the behavior of 
messages and antimessages iS reminiscent of the 
behavior of particles and antiparticles in 
physics (except for the fortunate lack of gamma 
rays). 


The rollback of process A is completed simply 
by sending antimessages for all of those messages 
that it sent between times 135 and 181. In each 
case a copy of the negative message is first en- 
queued in the output queue of A, causing annihila- 
tion of the positive copy that was there. The 
result is that the A now has no record that those 
messages, positive or negative, ever existed, and 
it is truly in the state it would have been if 
the message with virtual receive time 135 had 
arrived in its proper order. Antimessages are 
also delivered to their desintations with the 
following possible effects. (a) if the original 
(positive) message has arrived but has not yet 
been processed, then the negative message, having 
the same virtual receive time, will also be in 
the virtual future of the destination and will 
not cause a rollback. It will, however, cause 
an annihilation, leaving the system with no record 
that either message ever existed, exactly as we 
would want if the message were to be truly "un- 
sent". (b) Another possibility is that the origi- 
nal positive message has a virtual receive time 
that is now in the past with respect to the re- 
ceiver's virtual clock, meaning that it and 
possibly others with later virtual receive times 
have already been processed, causing side effects 
on the receiver's state, and causing the sending 
of more messages to a third level of processes. 

In this circumstance the negative message, having 
the same virtual receive time as the positive | 
one, will also arrive in the receiver's virtual 
past and will cause it to roll back to the vir- 
tual time when the positive one was received, 

It will also annihilate with the positive one, 

so that when the receiver starts executing for- 
ward again the situation will again be as though 
neither message had ever existed. As part of this 
secondary rollback more antimessages will be sent 
to the third level of processes, and the same 
actions we have been describing will proceed re- 
cursively. (c) There is a third case as well, 
namely that the negative message arrives at the 

In this case 


it is enqueued as usual, and will eventually te 
annihilated when the positive message finally 
arrives. If the negative message is actually re- 
ceived by the destination process before it is 
annihilated by the positive message, the receiver 
may take any action at all, e.g. ano-op. Any 
such action will eventually be rolled back when 
the positive message arrives. 


This recursive antimessage/rollback protocol 
is extremely robust, and works correctly under all 
possible circumstances. The levels of indirection 
may be to any depth, and there may even be cir- 
cularity in the graph of antimessage paths with 
no ill effects. The rollback process need not be 
atomic, and indeed many interacting rollbacks may 
be going on simultaneously with no special syn- 
chronization. There is no possibility of dead- 
lock, and the system does not have to be stopped. 
The worst case is that all processes in the system 
roll back to the same virtual time as the original 
one did, and then proceed forward again. 


Obviously the rollback mechanism can be ex- 
tremely costly in the worst case, but here are a 
number of arguments suggesting that in a realistic 
system it is not nearly as costly on the average 
as one might imagine. First, most systems operate 
in a pattern where each event involves one input 
message and one output message. Hence, the number 
of antimessages directly sent by any one rollback 
is approximately the number of events rolled-back 
over. There is reason to believe that most pro- 
grams obey the temporal locality princtple, namely 
that most messages arrive in the future according 
to the local virtual clock at their destination, 
thus not causing any rollback at all, and those 
that arrive in the virtual past tend strongly to 
arrive in the recent past so that few events are 
rolled back. Even where a rollback causes several 
antimessages to be sent we can expect that most 
of them will not cause secondary rollbacks 
Simply because they each have virtual receive 
times greater than or equal to that of the 
message that caused the original rollback, and 
generally the higher the virtual receive time of 
a message, the less likely it is to cause roll- 
back. The extent to which the temporal locality 
principle applies is obviously application-depen- 
dent and can only be verified empirically. One 
final argument is important: the "cost" of this . 
king of synchronization is only the cost of the 
rollback and antimessage overhead; the time it 
took to perform the computation being rolled back 
is not part of the cost, because the only alter- 
native would have been to be blocked for the same 
length of time anyway. 


It 
process 


is important in virtual time systems that 
creation and input/output be done by 
message so that they may be undone by antimessages 
just as any other side effects. The Time Warp 
mechanism is thus able to uncreate, unterminate, 
unerr, uninput and unoutput. We will return to 
these subjects in the next section. 


4.3 The global control mechanism 


The local control part of the Time Warp 
mechanism leaves a number of critical issues un-. 
resolved. How can we be sure that amidst all of 
the rollback activity the system makes progress 
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globally? How can global termination be detect- 


ed? How can errors and I/O be handled in the 
face of rollback? How can we avoid running out 
of memory when the local control mechanism calls 
for saving two copies of all messages and several 
copies of processes! states? The global control 
mechanism resolves all of these issues and 
several others. . 


The central concept of the global control 
mechanism is global virtual time (GVT). Global 
virtual time is a property of an instantaneous 
snapshot of a virtual time system, and is de- 
fined to be the greatest lower bound of the set 
of all virtual times shown by all virtual clocks 
tn thts or any posstble future shapshot. (See 
[Jefferson 83a] for detailed discussion of this 
definition, and for proof that GVT is well-de- 
fined in the case of infinite computations.) 


GVT serves as a floor for the virtual times 
to which any process can ever again roll back. 
It can be proved [Jefferson 83a] that the theo- 
retical definition of GVT for an instantaneous 
snapshot can be characterized operationally as 
the minimum of (a) all virtual times in all local 
virtual clocks in the snapshot, (b) all virtual 
send times in unreceived messages in the input 
queues of the snapshot, and (c) all virtual send 
times in messages that have been sent but not yet 
acknowledged (and may therefore be in transit 
at the moment of the snapshot). This character- 
ization leads to a fast, distributed GVT-estima- 
tion algorithm [Jefferson 83a] that takes O(d) 
time, where d_ is the delay required for one 
broadcast to all processors in the system. The 
GVT algorithm runs concurrently with the main 
computation and returns a value that is between 
GVT at the time of the start of the algorithm 
and GVT at the time of its completion. It thus 
gives a slightly out-of-date value for GVT, which 
is fundamentally the best one can do without 
stopping the entire system. 


It is a simple consequence of the definition 
that GVT never decreases, because any quantity 
defined to be the minimum of all future values 
of another quantity must be nondecreasing. In 
fact, we can prove that if every event terminates 
(and if a few other conditions hold [Jefferson 
83a]) then GVT eventually increases because the 
scheduling rule gives priority to the farthest 
behind processes. This nondecreasing property 
makes it appropriate to consider GVT as the 
virtual time for the system as a whole, and to 
use it as the measure of system progress. It 
measures how much of the system's activity is 
final and complete. Since GVT is a global pro- 
perty of an instantaneous snapshot there can be 
no way for a process to have access to its "cur- 
rent" value, but it has no logical need for that 
information because its own progress is measured 
by the value in its local virtual clock. . 


During the execution of a virtual time sys- 
tem the Time Warp mechanism must calculate GVT 
every so often. The actual frequency is a trade- 
off: high frequency produces faster response time 
and better space utilization (because of more 
frequent storage reclamation, to be discussed 
later) but also lower processor utilization and 


slower progress (measured for example in units of 
virtual time per second of real time). Except for 
the space savings this is exactly the same trade- 
off we are familiar with in time-slicing operating 
systems when we adjust the length of the time 
quantum. 


The next few sections describe the uses of 
GVT to control virtual time systems. 


4.3.1. Normal termination detection 


Termination detection in distributed systems 
has been an active field of research for some 
time now, with numerous papers published on the 
subject, e.g. [Dijkstra 80]. With the Time Warp 
mechanism, however, the detection of termination 
of virtual time systems is just one of several 
global problems defined by and solved in terms 
of GVT. Because we have assumed the processes 
do not spontaneously send messages we know that 
when a process runs out of messages it terminates 
normally and its local virtual clock is set to 
o , When GVT reaches © it means that all local 
virtual clocks read © and no messages are in tran- 
Sit, so no process can ever again "'unterminate" 
by rolling back to a finite virtual time. Thus, 
whenever the periodic GVT calculation returns 
co , the Time Warp mechanism signals termination. 


4.3.2. 


One of the features of the Time Warp mechan- 
ism is that it is possible to give simple, natural 
algorithms for managing memory. In addition to 
the memory used by the code and current data of 
processes (which the programmer is responsible 
for managing) there are four kinds of memory 
overhead to be managed. 


Memory management and flow control 


1. Old states in the state queues. 
2. Messages stored in output queues. 


3. "Past'’ messages (in input queues) 
that have already been processed. 


4, "Future' messages (in input queues) 
that have not yet been received. 


The first three classes of storage, used only to 
support rollback, are all managed similarly. Any 
message, input or output, whose virtual receive 
time is less than GVT can be discarded, as it is 
impossible to roll back to a virtual time when 

it might be either re-received or cancelled. 
Similarly, for each process all but one saved 
state older than GVT can be discarded. This 
recycling of outdated memory is called fossil 
coltlectton. 


Managing the fourth class of storage, that 
containing unreceived messages, is essentially 
the flow control problem common to all distributed 
systems. But because we assume every message is 
stamped with the virtual coordinates of the send- 
ing and receiving events, and because rollback is 
possible, the flow control problem has more struc- 
ture than it usually does and a new approach is 
warranted. In most environments, where the only 
synchronization tool is process blocking, flow 
control protocols. act as a valve to limit the flow 
of messages from sender to receiver. The receiver 
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must be careful never to accept too many messages, 
because every message it accepts responsibility 
for it must buffer. But under the Time Warp 
mechanism, if a receiver's memory is full of input 
messages he can always make space by sending back 
an unreceived message. The original sender must 
then roll back to the state it was in when it sent 
the message, and it will resend the message as it 
executes forward again. The natural message to 
send back is the unreceived message having the 
latest virtual send time (regardless of where it 
came from). 


Returning a message to the sender is the 
message-analog of process rollback; they are 
obverse and reverse of the same coin. We do not 
have space here to give more detailed arguments 
in favor of this protocol, but it seems to offer 
very efficient and simple flow control for virtual 
time systems. Of course this hypothesis also 
needs empirical verification. 


4.3.3. Error detection 


When a process commits a run-time error its 
State must be marked ERROR. This should not 
necessarily cause termination of the entire com- 
putation because the erring process might roll 
back and "unerr', sending an antimessage for the 
message containing the error indication. An 
error is only "permanent'' if it is impossible for 
the process to roll back to a virtual time ear- 
lier than that of the error. 


A process in an ERROR state keeps executing, 
but all successive states are also ERROR states. 
If and when an ERROR state is fossil-collected 
(because its virtual time is older than GVT) then 
no rollback will ever undo the error. It should 
then be reported to some error policy software 
and/or to the user. 


4.3.4. Input and output 


When a process sends a command to an output 
device or any other agent external to the Time 
Warp mechanism it is important that the physical 
output activity not be committed immediately 
because the sending process may roll back and 
cancel the output. The criterion for deciding 
when a command to an output process can actually 
be performed is that GVT must exceed the virtual 
receive time of the message containing the com- 
mand. After that point no antimessage for the 
command can ever be generated, and the output can 
be physically committed. 


This policy is implemented in the Time Warp 
mechanism by designating certain processes as 
output processes. Such processes have null bodies 
and act primarily as input queues. If a message 
to an output process is not annihilated by an 
antimessage it will remain there until it is 
fossil-collected, at which point it is known that 
no rollback can ever occur to cancel the output. 
Fossil collection then, in addition to recovering 
memory and detecting errors, must also physically 
commit all output action. Similar considerations 
hold for input. 


4.3.5. 


The problem of taking a consistent, global 


Snapshots and crash recovery 


snapshot that is useful for continuation in the 

event of a crash arises with all distributed sys- 
tems [Russell 80]. 
of real time is, of course, theoretically impos- 


sible to implement because we cannot have perfect- 


ly synchronized real time clocks [Lamport 78]. 
In a virtual time system a snapshot of all pro- 
cesses and the relevant messages at a particular 
virtual time (which must be taken at different 
real times for different processes) forms a 


natural and meaningful snapshot of the system that 


-is easily implementable. A full (nonincremental) 
snapshot of the system at virtual time ft can. be 
constructed by a procedure in which each process 
snapshots itself as it passes virtual time ¢ in 
the forward direction, and "unshapshots" itself 
whenever it rolls back over virtual time Tt. 
ever GVT exceeds t¢t the snapshot is complete and 
valid. A variation on this procedure can provide 
an incremental snapshot facility. Because of 
space limitations further details must be post- 
poned to a future paper. 


5. Examples of virtual time systems 

A wide variety of distributed systems can be 
Viewed as virtual time systems. 
subsections we give examples to illustrate the 
range of the concept, and in particular, we show 
how both the distributed discrete event simula- 
tion problem and the distributed database con- 
currency control problem are instances of the 
virtual time paradigm. 


5.1. Example 1: Distributed discrete event 

simulation 

The most extensively studied virtual time 
system is distributed discrete event simulation, 
in which every process represents an object in 
the simulation and virtual time is identified 
with simulation time. The fundamental operation 
in discrete event simulation is for one process 
to schedule an event for execution by another at 
a later simulation time. In a virtual time sys- 
tem we emulate this action simply by having the 
first process send an event message to the second 
process with its virtual receive time equal to 
the event's scheduled simulation time. The re- 
quirement that each process must receive messages 
in virtual receive time order is 
fundamental semantic requirement of simulation 
that events be executed in simulation time order. 
Simulation is clearly one of the most ''general" 
applications of the virtual time paradigm because 
the virtual times (simulation times) of events 
are completely under the control of the user, 
and because it makes use of almost all of the 
degrees of freedom allowed in the definition of 
a virtual time system. Any mechanism for general 
distributed discrete event simulation can be used 
as an implementation of virtual time. Chandy and 
Misra in [Chandy 81] give a simulation method 
based on blocking and distributed deadlock detec- 


tion. See [Jefferson 82] for a detailed comparison. 


5.2. Example 2: Distributed database concurrency 
control 
In a distributed database the fundamental 
synchronization problem is to make distributed 
transactions appear to be atomic with respect 
to other transactions. To accomplish this effect 
in a virtual time system it is only necessary to 


A snapshot at a single instant 


When- 


In the next three 


equivalent to the 


392 


do two things. First, the entire database systen, 
including all transaction software, management 
software, and even the data, must be cast as a 
collection of processes communicating by message. 
In particular, data items must be viewed as 
stunted processes that respond primarily read and 
write messages. Second, the system must ensure 
that each transaction must execute within a band 
of virtual time that does not overlap with the 
bands allocated to other processes. This can be 
done simply by using the real time of a trans- 
action's initiation as the high order bits of the 
virtual time band allocated to it (with the 

place of intiation as middle-order bits to break 
ties). The apparent indivisibility execution 

of transactions follows directly. A full de- 
scription of virtual time as the basis of data- 
base concurrency control can be found in 
[Jefferson 83b]. 

This assignment of virtual time bands to 
transactions guarantees not only that they are 
atomic, but that they are apparently executed 
(committed) strictly in virtual time order, i.e., 
in the order of their initiation. This is a 
stronger scheduling constraint than the usual 
serializability criterion. Of course, although 
the transactions appear to be executed sequen- 
tially in virtual time order, they are actually 
executed by the Time Warp mechanism concurrently, 
or in any convenient order. 


Virtual time used for database concurrency 
control is quite different from that used for 
distributed simulation. First, virtual time 
is derived from real time in the database example, 
whereas it is completely independent of real time 
in the simulation case. Second, virtual time 
values in simulation are actually manipulated by 
user software as ordinary data. In databases 
the behavior of the system is time-independent 
and the virtual times are presumably "hidden" 
from most levels of DBMS software. 


In many respects the Time Warp mechanism 
applied to database concurrency control is simi- 
lar to multtple-verston coneurrency control 
mechanisms [Papadimitriou 82] in that it main- 
tains several successive versions of each data 
item and it has the ability to achieve serial- 
izability in spite of transaction collisions 
because it can satisfy a read request that is 
time-stamped earlier than the "current'' version 
of the data by simply accessing a saved earlier 
version. But when a multiple-version mechanism 
is faced with a wrtte request that is time- 
stamped earlier than the current version of the 
data there is no choice but to abort the entire 
transaction that the write request is part of 
(and possibly some other transactions in progress 
as well), and to restart it with a later time 
stamp. In situations where there is a high pro- 
bability of transaction collision this leads to 
alot of wasted computation and to the possibility 
of starvation. The Time Warp mechanism, however, 
never aborts a transaction. It may roll back 
parts of several transactions when a collision 
occurs, but the amount of the computation un- 
wound during a collision is limited to that part 
that would be causally affected if requests were 
handled out of time stamp order. 


Other parts of the colliding transactions that are 
not functionally dependent on the data involved 
in the collision are not rolled back. 


re Virtual circuit communication 


Example 3: 


One of the main functions of a network com- 
munication protocol is to provide a virtual cir- 
cuit facility, a buffering and synchronization 
mechanism that, for each sender-receiver pair, 
delivers messages to the receiver in the same 
order they were sent. This effect can be accom- 
plished automatically in a virtual time system 
if the virtual receive time of a message is de- 
fined to be the real time of its sending. The 
requirement that messages be processed in virtual 
receive time order is then the same as requir- 
ing them to be processed in sending order. Im- 
plementing virtual circuit communication as a 
virtual time system does not confer any particular 
speed or concurrency benefits over other (simpler) 
implementations, but it does show something of the 
breadth of the virtual time concept. There may, 
however, be important practical benefits in en- 
vironments where the ability to checkpoint a dis- 
tributed system is required. 


6. Extended analogy to virtual memory 


Now that we have presented the outlines to 
the theory we can explain the use of the term 
"virtual time''. I see virtual time as the natural 
temporal analog of virtual memory, and have been 
consciously guided by the lessons of virtual 
memory systems from the beginning of this work. 
There is a compelling extended analogy between the 
two which may lend some credibility to this other- 
wise unorthodox approach to distributed computa- 
tion. Because of space limitations we will pre- 
sent the comparison as a sequence of parallel con- 
cepts in which space and time play almost syn-. 
metric roles. 


- A page is analogous to an event. The 
virtual address of a page is its spatial 
coordinate; the virtual time of an event 
is its temporal coordinate. 


- A page in memory at time ¢t is analogous 
to an event in the future of process x; 
a page out of memory at time ¢ is analo- 
gous to an event in the present or past 
of process x. 


- Accessing a page in memory is comparatively 
inexpensive, but accessing a page out of 
memory causes a very expensive page fault. 
Analogously, sending a message into the 
virtual future of a process is comparative- 
ly inexpensive, while sending a message 
into its virtual past causes a very ex- 
pensive ttme fault, i.e., rollback. 


‘ -~ In view of this it is only cost-effective 
to execute programs under a virtual memory 
system that obey the spatial locality 
principle, i.e., that most memory accesses 
are to pages already resident in memory 
so that page faults are rare. Likewise, 
it is only cost-effective to run programs 
under a virtual time system that obey the 
temporal locality principle, i.e., that 
most messages arrive in the virtual future 
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of the destination processes so that time 
faults are rare. 


- Memory mapping converts virtual addresses to 
real addresses at different times. Tvtme 
mapptng (also known as scheduling! )maps 
virtual times to real times; the same virtual 
time may be scheduled at different real times 
in different processes. 


- The only acceptable memory maps are the one- 
to-one functions because they preserve dis- 
tinctness, i.e., map distinct virtual 
addresses to distinct real addresses. The 
only acceptable time maps are the strictly 
increasing functions because they preserve 
distinctness and order, i.e., map distinct 
virtual times into distinct real times, 
earlier virtual times into earlier real 
times, and later virtual times into later 
real times. 


The analogy between virtual time and virtual 
memory can be greatly extended (working set, 
thrashing, pointer, data structure) and it is 
very thought-provoking to do so. It suggests 
that the concept of virtual time can provide the 
same kind of clean, efficiently implementable 
abstraction of the time resource in a distributed 
environment that virtual memory has provided for 
Space resources, 


7. Future work 


There is still a tremendous amount of work 
to be done to follow up on this research. For 
example, there is very little empirical or analy- 
tical work published on the performance of systems 
incorporating rollback; this is the most important 
immediate issue. In the longer term, a program- 
ming language for concurrent systems with virtual 
time as its semantic basis seem reasonable. It 
should also be possible to incorporate virtual 
time into specification languages designed for 
specifying and verifying the synchronization re- 
quirements of concurrent systems. Finally, it 
seems natural to suppose that a generalization 
of both virtual memory and virtual time--virtual 
spacetime--should be "out there", waiting to be 
explicated. The field seems wide open. 
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PROCESS MANAGEMENT OVERHEAD IN A SPEEDUP-ORIENTED MIMD SYSTEM 


Ruknet Cezzar 
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Human Resources Administration 


Abstract -- The effects of Process Management 
Software (PMS) on the throughput of ari MIMD machine 
are studied. Queueing-theoretic models are used 
to predict the effective throughput where only the 
running of PMS is taken into account. Performance 
problems with Centralized PMS, a method of decen- 
tralization through multiple ready lists, and the 
required extent of decentralization are discussed 
and analyzed. 


As a way of avoiding centralized mechanisms 
and the resulting performance bottleneck condi-- 
tions, a policy is studied where i) multiple ready 
lists are used, ii) processors access the ready 
lists randomly and uniformly. The effective 
throughput for this policy is obtained by an approx- 
imate solution to a highly complex closed queueing 
system. This randomized-access policy is shown to 
achieve effective throughputs almost linear inn, 
for ann-processor MIMD system. The policy is also 
shown to achieve speed up factors very close to n, 
regardless of how large n is. 


Lem “Introduction 


An MIMD machine is a tightly-coupled system 

in which a large number of processors share primary 
memory, and perform parallel computations in a 
cooperatively-tasked manner. If contention for 
shared computing resources and the processing over- 
head associated with the management of concurrency 
are minimized, high computational speeds can be 
achieved. 


This paper deals with the performance of 
Process Management Software (PMS) for achieving 
high throughput on the assumption that performance 
issues at the architectural level have been re- 
solved. PM overhead and its minimization are dis- 
cussed and analyzed. Assuming that the only 
performance overhead is the running of PMS; 
parallelization of PMS to achieve nearly constant 
processor utilization, (i.e., speedup factors very 
close to n for an n-processor system) is also 
studied. 


The architectural features of an MIMD machine 
which are relevant to the analyses inthis paper are: 


- Identical processors, 


~- Uniform access to common memory, where no 
processor is favored over the others; 


~ Uniform sharing of peripherals and secondary 
storage, in a way similar to the sharing of 
primary memory. 


ec. The Process Management Software 


The PMS of an MIMD machine has the following 
runctaons: 


0190-3918/83/0000/0395$01.00 © 1983 IEEE 
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Creating processes, 


Putting processes to sleep, 


Waking processes up, 


Assigning processes to processors, 


Destroying processes. 


The PMS will consist of executive code and 
the system tables referenced by that code. The 
system tables are: i) Ready List(s) which con- 
tain the state vectors of processes ready for 
processor assignment; ii) Blocked Lists which 
contain the state vectors of processes blocked 
for shared resources. 


In a centralized PMS there is one global 
ready list, and its use by the processors requires 
mutual exclusion. 


In such a PMS, there is a degradation of 
effective throughput as the number of processors 
becomes large. In what follows we will study the 
extent of this degradation as well as the per- 
formance of PMS when it is decentralized through 
the useof multiple ready lists. 


3. The Throughput Models 


In queueing-analytic models, the processors 
are treated as active elements sharing the ready 
lists as resources. The throughput is measured in 
number of user-level instructions per unit time. 
This is the "effective" throughput which indicates 
the work done by the processors while not engaged 
in ready list access. System-software overhead, 
other than those associated with ready list 
accesses, is not taken into account. The effec- 
tive throughput results may be interpreted accord- 
ingly. 


The queueing models yield the average (expect- 
ed) number of processors in normal execution (user- 
state). Multiplying this by the rated speed, same 
for all processors, gives the effective throughput. 
The relationship of this to other common perform- 
ance measures for multiple-processor systems is 
given below. Let: 


s = Rated speed of each processor. 
Kk = Avg. number of processors in user-state 
in an n-processor MIMD system. 
FY, = Finishing time of a computation, when 
run on an n-processor MIMD system. 
Fj, = Finishing time of the same computation, 


when run on a 1l-processor system. 


a = 1-K, Percent of time @ uniprocessor would 


spend in ready list access, with no con- 
tention from other processors. 


In terms above, the performance measures for 
an n-processor system: 


se) Effective Throughput. ae) 
- 

on i Ea 

S =—=-H or § =-+ Speedup Factor. (2) 

n T,  Il+a Se ss 

S K 

R, === Se Relative Utilization Rate. (3) 
7 (1-a)n 

E, = as Relative Efficiency. (4) 
Rn 

4. Centralized PMS Model 


Suppose there is a single copy of the ready 
list, residing in common memory, and accessible to 
all processors. For this queueing model and for 
Subsequent models, the following general assump- 
tions are made: 


- No processor is favored over the others in 
accessing the ready list(s); 


- Processors are independent, and interact only 
with respect to ready list accesses. 


In addition to n, and m= 1 (single ready 
list), we define the following input parameters: 


1/rA: Actual (mean) time between ready list 
accesses, per processor. 
1/u: Actual (mean) time inside the ready 


list, per ready list. 


The parameters which are derived from the above 


and referred frequently are: 

0 = A/u: Ready list access traffic intensity 

per processor. 

a = A/(Atu): The PMS overhead for a 1-processor 

system as defined earlier. 
These parameters arerelated asp = a/(l-a) and 
a = p/(ltp). 
As the target parameters, we define the follow- 
ing random variables, and their expectations: 

K: Number of processors in normal execution, 
running user-level processes. 

N: Number of processors engaged in ready list 
access, attempting to gain its lock or 
inside it. 

Q: Number of processors attempting to gain 


access to the ready list, but not inside 
it. These processors are idled. 
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The results are obtained under two sets of 
assumptions: 

1. Deterministic Case: The rate of ready list 
accesses and time spent inside the ready 
list_are both constant. The model is solved 
for K = Average number of processors in 
normal execution. 


Probabilistic Case: The rate of ready list 
- accesses and time spent inside the ready list 
vary according to Poisson-exponential law. 
The aimis to find E(K) =Expected number of 

processors in normal execution. 


4.1 Deterministic Case 


The parameters ’ and up stand for the actual, 
rather than the mean, arrival and service rates. 
The purpose of deterministic assumptions is to un- 
cover the best possible throughput with respect to 
ready list access and service patterns. Otherwise, 
these assumptions, especially with respect to the 
arrival rate are not very realistic. 


This case can be analyzed as a closed queueing 
system where K stands for the (average) number of 
processors in normal execution. The transient be- 
havior of the system depends on how initially 
processors access the ready list, and there are an 
infinite number of possible solutions for K. On 
the other hand, because of the sequential server, 
even if all processors access the ready list at 
the same time initially, they would lineup after a 
number of cycles. In other words, after each 
processor completes one cycle of normal execution 
and ready list access, the accesses to the ready 
lists must be at least 1/u time units apart. Depend- 
ing on the value of p, this self-regulation places 
a limit on the maximum queue length. Keeping this 
self-regulation in mind, the steady state solution 
for K can be obtained by analyzing the following 
work-balance equation: 


(1-Py)u (5) 


= Fraction of time no 
processors are engaged 
in ready list access. 


Rig. Ee 


Through a close analysis of a typical cycle of 
length amt + pol, first Po is determined for the two 
distinct ranges p < (n-1)71 and p > (n-1)71. Then, 
the solution, as discussed in (1): 


[(o+2) "In = (l-a)n for O<op < (n-1)71, 


K (6) 
p~t = (a7 t-1) 


for (n-1)7l<p<e, 


According to this, Q = 0 in the first range , 


and Q = (o+1)72 a o7> in the second range. This 
means that the idled processors can grow at most 
linearly with n. As shown later, the growth is 

more rapid for stochastic systems. 


4.2 The Probabilistic Case 


The birth-death assumptions for this case were 
intended to simplify the analysis, and to find a 


conservative (i.e., nearly worst-case) throughput 
behavior with respect to ready list access and 
service patterns. Since one can conceive hyper- 
exponential interarrival or service times, these 
assumptions do not necessarily correspond to the 
worst-case. However, since the closed queueing 
system is self-regulating, how far the worst-case 
can be from the birth-death case is an interesting 
and open question. 


For this model, one way to find E(K) is to 
look up the solution for E(N), and then evaluate 
E(K) =n - E(N). The solution for E(N) corresponds 
to the Finite Population - Single Server Queue dis- 
cussed in (10) and elsewhere: 


n 
1 ! 
N = iP 1 n. (7 
E(N) bo iPop (car ) 
n -l Probability that 
where P, = |) ot _n! | =no processors are 
0 4=0 (n-i)! engaged in ready 


IVSeE access. 


From (7) above, the effective throughput of a 
l-processor system is (l-a)s. To find the effec- 
tive throughput for n>1, first Pp and then n-E(N) 


need to be evaluated. This could be time consuming 
for n>10, which is likely to be the case for the 
MIMD system under study. As discussed in (1), if 
we solve directly for E(K), using the complementary 
birth-death formulation, simpler and more revealing 
‘results are obtained. According to this: 


The expression for E(K) above is free of Po» 
and can further be simplified as: 


The expected number of 
processors in normal 
execution. 


(8") 


where F(*) is the cumulative d.f. for Poisson 
with mean 1/p. 


The expected number of idled processors are: 


n- 


(9) 


E(Q) ab, Ye 
a F 


(n) 


The result in (9) is the same as E(idle) found 
in (7), except that it is in more convenient form 
for evaluation. Consider the numerical example 
discussed in (7) where p =.05 and a=.0h76. For a 
14-processor system, we can look up F(13) and F(1}4) 
from Poisson tables and obtain: 


R(Q) = 1h - —2 -O661 _ Th 
2) -OL76 .10h9 a 
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This agrees with the results in (7) where the 
authors mention that "lh-processor system has the 
effective power of 13.25 processors." 


Figure-2 shows the results for a=.076 and 
a=.l. The shaded areas provide a plausible per- 
formance range with Centralized PMS. The solu- 
tions to all hypoexponential interarrival/service 
times would lie in the shaded regions. For 
example, a highly realistic case of Poisson 
arrival rate and constant service times would lie 
in between deterministic and birth-death curves. 


4.3  Conelusions for the Centralized PMS 


As implied by the results of the two specific 
cases analyzed (deterministic, birth-death), there 
is an upper limit on the effective throughput. 
Since the average number of processors in normal 


execution can at most be grt. this upper limit is 


so7t and does not depend on n. This result can 
be generalized to the case where no assumptions 
are made about ready list access and service rates, 
since the following work-balance equation holds: 


= KA 


e (10) 


(1-P, )u 


where Po is the fraction of time the server is 


idle, or the probability that all_processors are 
in normal execution. Kn denotes K or E(K) for an 


n—-processor system. From (10) above, it follows 


that: 


K 


: ce) 


(1-P))o* 


Since 1-P, is at most 1, Ko is at most ae and 
the upper limit on the effective throughput is 


so7t, where s is the rated speed of each processor. 
For example, if 9p =.05, on the average, there 
can at most be 20 processors in normal execution; 


even if the MIMD system employs 256 processors. 


Notice that this result holds true even if 
interarrival or service times are probabilistically 
dependent (i.e., ready list requests are serially 
correlated, etc.). For the work-balance equation 
in (10) to hold, the only requirement is that the 
services are independent of the arrival process. 
For this MIMD system, there is no reason to the 
contrary. This result has further been generalized 
in (1), where the mean traffic intensities (p, for 


processor i) are not necessarily equal. In that 
case, the upper limit on the effective throughput 


is sovt 
Pmin 
intensity. 


» where Ouse is the minimum mean traffic 


This result can more formally be stated as 
follows: 


Rule #1: 


For the type of n-processor MIMD system with 
Centralized PMS, there is an upper limit on the 
effective throughput. Since this upper limit does 
not depend on n, as n becomes large, the marginal 
contribution of additional processors to the 


effective throughput diminishes. If and when this 
upper limit is reached, the marginal contribution 
of additional processors to the effective through- 
put is nil. 


Proof: 


As was already shown, the effective through- 


puts T= so7t at most. As a restatement of the 
above, we must show that the relative utilization 
rate (i.e., relative to a l-processor system) de- 
creases aS n increases, and tends to zero as n 
grows very large. From (3), by definition, the 
relative utilization rate is: 


1-P, 


an 


using the value for 
K, in (11). 


Since 1-Pg can at most be 1, it is obvious 


Enect. Lor n>arvl, Rn decreases as n increases. 
Furthermore: 


—r 


Limit [(1-P5)(any3| 


nee 


Limit R, 


nee 


On 


Similar arguments can be carried out for the 
case where p,'s are not the same for all processors. 


In that case, a will be replaced by a,;,, where 


Omin ~ Prin’ (i enen) 


5. Decentralized PMS Model 

The results of the previous section point to 
the need for decentralization of PMS. Consider the 
deterministic case with a single ready list. This 
gives the best-case throughput for any given value 


of Pp, and the upper limit is.reached when n ant, 
For example where a = .0476, the upper limit is 
reduced when there are 21 processors in the MIMD 
configuration. If the MIMD system employs more 
than 21 processors, and if we cannot adjust the 
values of s and p by some other means, there is a 
need for decentralization of PMS. 


The MIMD machine under study, being speedup- 
oriented, may be employing the fastest processors 
available. In that case, we cannot further in- 
crease the rated speed s, or run the PMS on a 
faster processor. The ready list access traffic 0 
would depend on the parallel algorithm and the de- 
gree of interprocess communication, where a new 
process gets the processor when another processor 
is put to sleep. This parameter cannot always be 
adjusted to the desired value. Therefore, the PMS 
itself should be parallelized to run concurrently 
on multiple processors. 


As the first step, we can use the decentrali- 
zation scheme discussed earlier, where multiple 
ready lists are used. Then, there is the problem 
of how the processors should be accessing these 
ready lists, without any need for some other 
centralized mechanism. 


The ready lists, each accessible to any one of 
the processors, may reside anywhere in common memory; 
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preferably, in different memory banks to minimize 
memory conflict. To find the effective throughput 
when m (>1) ready lists are used, we can analyze 
the multiserver extension of the previous model. 
This is shown in Figure-l. 


We again look at the deterministic and birth- 
death cases, and assume that: 


The time spent inside a ready list does 


Ls 
not depend onthe number of ready lists used; 
ec. A processor always accesses a ready list 


which is currently available (unlocked). 
If none is available, it waits in idle 
state until one becomes available. 


The first assumption above implies that the 
Size of a ready list, interms of the average number 
of activation records stored in it, does not have 
any effect onthe time to enter or remove activation 
records. If the ready list is implemented as First 
In-First Out linked list of activation records, 
this assumption is highly realistic. 


The second assumption implies that the pro- 
cessors know, without additional processing over- 
head, which ready lists are currently available. 
This additional overhead may be significant for 
the MIMD system considered, and is discussed in 
the next section. The purpose of this assumption 
is to find the minimum number of ready lists which, 
subject to the minimization or elimination of the 
additional overhead mentioned, will make it pos- 
Sible to achieve the desired throughputs. 


5.1 The Multiserver Model Results 
With mathematical arguments detailed in (1), 


the solution to the multiserver extension of the 
deterministic case is given as: 


= (o+1)~In=(1-a)n for O<p<m(n-m)71. 
K = i" i (12) 
ov tm = (a 7-1)m for m(n-m)~<p <, 
As an interesting example, we can choose 
m= on ready lists. For a= .1, Figure-3(a) com- 


pares this to the centralized case (m=1) and to 
the fully-decentralized case (m=n). The case where 
m>n is the same as the case m=n. 


As intuitively expected, for constant ready 
list access and service rates, providing m an 
ready lists can achieve effective throughputs which 
grow linearly with n. As will be shown shortly, 
this is not necessarily the case for stochastic 
ready list access or service rates. 


The solution to the multiserver extension of 
the probabilistic case, despite simplifying birth- 
death assumptions, is highly complex in the range 
1<m< on. For this range, a highly concise solu- 
tion for E(N) can be found in (11): 


O to m-l. 


ll 


Por 4: 
Pr(N=i) = a (17) 
miata for i 


Wi GO a2 


| 


m—1 ; n e 
where P. = ) Bo + ) fie — 
0 j=0 Ut jem U mime 
n 
#(Q) = y (eae (SS Expected number of (18) 
eS idled processors. 
i=m 
m-1 m—1 
E(N)= ) apr(N=i) + E(Q)+m/1- ) Pr(NSi)| (19) 
i=m i1=0 


From E(N), we can obtain E(K) = n - E(N) and 
find the effective throughput. However, the ex- 
pressions are quite complex and require evaluation 
of Pop. By solving the complementary birth-death 
formulation directly for E(K), we can obtain a 
simpler and more revealing expression for E(K). 
Accordingly: 


Po(n) 


E(x) os mo OC 
LP aoa) 


Lor =< ae <i 4 


-l 
n-m k n 2 
es Le ee Das, AE 


k=n-—m+1 nt 


In effect, the solution to the most difficult 
case of 1 <m <n is found in terms of: 


(n): Probability that no 
normal execution in 


Probability that no 
normal execution in 
system. 


Note that E(K) still needs 


processor is in 
an n-by-m system; 


processor is in 
an (n-1)-by-m 


to be evaluated. 


The evaluation is simpler, since the same procedure 
can be used for Po(n) and Po(n-1). For the fully- 


decentralized case where m > n, 
simpler: 


E(K) = (p+1)71 


Clearly, for the case of m 


p= Cleon or 


the solution is 


i Fons 


(3) 


>n, E(Q) = 0; 


Since a customer will always find an available 


server. 


The interesting case of m 
shown in Figure-4(b), where m = 


an ready lists is 
1 (centralized) and 


m=n (fully-decentralized) cases are also shown 


for comparison. 


The ‘portion of m= qn -curve up to 
10 processors are shown as a visual aid. 


Lt cor- 


responds to the anomaly, where presumably less than 


one ready list is used. 


From Figure-3(b), we see that, under birth- 


death assumptions, employing m 


an ready lists 


does not yield effective throughputs which grow 


linearly with n. 
will be formalized later. 


This counter-intuitive result 


5.2 Conclusions for the Decentralized PMS 


About the required number of ready lists, we 
can assert and prove the following statements. 


Rule #2: 


If a fixed number m (<n) ready lists are 
used, the effective throughput is still bounded 


- and 


above by msp7 Therefore, as n exceeds ma 
gets larger, the relative utilization rates 

diminish. As n grows very large, the contribution 
of additional processors to the effective through- 


put becomes nil. 
PYOOL? 

Since K > F(K), we only need to look at the 
deterministic case. As n becomes sufficiently 
large and exceeds ji the second part of the 
solution in (12) applies, where K = mp~+ at most. 


The effective throughput can therefore be at most 


msp +. In this case, the relative utilization is: 
es eg ae _ 
Ro = 7 OF = oaandS oLimit RR, = O. 
n (l-a)n on n 
noo 
Rule #3: 


Under deterministic assumptions, if we employ 


m=cn ready lists for some constant c >not, con- 
stant utilization rate can be achieved. 


Proof: 


For n sufficiently large, the second part of 
solution for deterministic case (11) applies. If 
we replace m by en in (11), we get: 
pote 
(1-a)n 


K = (en) and ie = a A constant. 


a 


As a corollary to this rule, since by defi- 
nition, the speedup factor: 


The ideal speedup factor of n can be achieved 
ie ey es 


Rule #h: 


Under stochastic assumptions, where there is 
a finite probability of having all processors en- 
gaged in ready list access (i.e., PO) we must 


employ at least m=n ready lists in order to 
achieve constant utilization rates. This means 
that, if we employ m = cn ready lists for some 
constant c, where c is in the range + < ex not, 
constant utilization rates are not achieved. 


Proof: 


First note that for most common (and non- 
contrived) stochastic systems P, > 0, as is the 


case with the birth-death assumptions. For the 
above claim, we provide a proof by contradiction. 


Suppose we do achieve constant utilization 
rate, where ae 1 - c, for some constant ec in the 


range = < ec < Bet. By definition of relative 


utilization: 


E(K) 
(1-o)n 


l-ec. (14) 


Also, for any stochastic system, the following 


work-balance equation must’ hold: 


uln-E(K)-E(Q)] = AE(K). 


C1) 


Solving (14) for E(K), and then, solving (15) 
for E(Q); we obtain: 
E(Q) = en, under the original assumption (16) 
of Ry oe 


Suppose P; is the probability that i processors 


are engaged in ready list access. Then, by defi- 


Nation: 


L=m+1 


If we choose the best possible value for m, 
namely m = n-1, from above, we obtain E(Q) =P 


On the other hand, we obtained E(Q) 
What this means is that: 


n° 


en earlier. 


P 
n 


(17) 


en if we assume that Ry =l-e 


(i.e., constant utilization). 


Since P, is the finite probability mentioned 


earlier, the equation in (17) would make sense only 
for the following trivial cases: 
a) 


O. 


> QO, and essentially corresponds to the 


PS This violates the condition where 
Pn 
deterministic case. For the deterministic 
systems, the serialization of ready list 
accesses does not permit having all n 
processors engaged in ready list access. 


ii) n=1. In this case, m = n-1 = 0, 
meaning no ready lists. 
Lid): ene This violates the condition 


where c > nt, This case corresponds to 
the anomaly of using less than 1 ready 
list. Refer to the m an curve for 

n < 10, in Figure-4(b). 


As a corollary to this rule, if constant 
utilization cannot be achieved when m < n and 
Pa > 0; the ideal speedup factors also cannot be 


achieved. 


The foregoing rules indicate that we should 
provide at least m=n ready lists tomake constant 
utilization (or possibly the ideal’ speedup factor) 
achievable. 
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6. Proposed Ready List Access Policy 


Even if we 
head associated 
available ready 
If we provide a 


ready lists are 


provide m = n ready lists, the over- 
with processors accessing the 
lists should be taken into account. 
global index which indicates which 
currently available, this global 
index will also be serially reusable. In much the 
same way as the single ready list, this global 
index will place an upper limit on the effective 
throughput. Suppose the time to check this global 
index is 1/u'. In this case, by analysis of the 
deterministic case not discussed here, the upper 
limit for K turns out to be 


1 


U u! 
Tae + 
i at most, and for n > lL renin 


K 


For the earlier example where 0 e094. Leu 
XA = .05 and u 1. If the time to check the 
global index is only half of the time inside a 
ready list, then u' ey end: 


250 
605 


K = =O at, most. tor 24s. 

Obviously, if the time to check this global 
index increases with n, the result for K is worse. 
It is clear that, providing a centralized global 
index is unacceptable for the type of MIMD system. 
Other possible approaches and their shortcoming 
have been discussed in (1). 


As a way of avoiding centralized mechanisms, 
we propose the following ready list access policy: 


- Processors choose to access the ready lists 
randomly and uniformly (with equal proba- 
bilities). If the chosen ready list is 
currently locked, the processor waits in idle 
state until the lock is removed. 


6.1 The Random-Routing Model 

The analytical model for predicting the effec- 
tive throughput for the proposed ready list access 
policy is shown in Figure-5. 


These types of models with uniform and random 
routing of arrivals to servers received consider- 
able attention in literature; primarily with re- 
spect to memory interference. Works done in (8,9) 


correspond to the deterministic case where ,~L=0 


and p7t = 0. A review and comparison of various 
approximate solutions to these types of models can 
be found in (12). The exact solution to these 
types of models are difficult or computationally 
intractable, even when the underlying stochastic 
process is Markovian. The analysis is more tract- 
able for infinite number of customers (processors) 
or servers (ready lists). However, we are pri- 
marily interested in specific -and finite- values 
of n and m. 


First note that, for any given set of param- 
eters n, m, and 0; the worst-case corresponds to 
the Centralized PMS Model results with m Ae 
Similarly, the best-case E(K) corresponds to the 
Decentralized Model results. 


For the deterministic case, K is obtained 
through a simulation program described in (1). The 
results of this simulation for a .O476 and a =.1 
are given in Table-1l. Notice that the program was 
validated by running the case of m=l1 for which the 
solution is known. 


NUMBER OF EXPECTED NUMB EXPECTED NUMB 


NUMBER OF READY OF PROCESSORS OF PROCESSORS 
PROCESSORS LISTS ENGAGED IN NORM-EXEC. 

(NN) (M) E(N) E(K) 

5 5 0.24 6 

EO 10 0.48 9.52 

15 15 O73 deleeag 

20 20 0.98 19.02 

25 25 ines ae re ad 

30 30 Leh 28.53 

5D 3) Lets S522)( 

ho ho 1.97 38.03 

Run #1: 9 = 1/20 es time units in normal execution. 
1 time unit inside a ready list. 

5 5 0250 4.50 

TO 10 1.00 9.00 

15 aes 1254 PSN 

20 20 2. OF L203 

25 25 2 so: 22.39 

30 30 ere 26.88 

35 5 y) 3.64 31.36 

ho ho ores g 35.03 
. = 9 time units in normal execution. 
Rane Pe ‘a time unit inside a ready list. 


* Program's output when m 1 is used as input, to 
compare with the analytical results of the deter- 
ministic single ready list model: 


a=-10 n _N_ K 
5 5005 ue hgg5 
10 1.0047 8.9953 
as 5.9957 9.0043 


Table-1l: Throughput Results for the Proposed Ready 
List Access Policy Under Deterministic 


Assumptions. 


Under birth-death assumptions, the exact solu- 
tion exists. As outlined in (10) the solution 
method involves analysis of a queueing network with 
m+l1 nodes. This solution requires enormous compu- 
tational effort even for moderate values ofnandm. 


Instead, we can demonstrate the effectiveness 
of the random-access policy by obtaining an approxi- 
mate solution for E(K), and making sure that it is 
not overestimated. This approach is based on the 
analyses of local queues found in front of the 
ready lists. 


Since the overall Poisson arrival rate is 
randomly split into m, the arrival rate to a local 
queue is also Poisson, with mean \/m per processor. 
The self-regulation is more complex. To simplify 
matters, we assume that: 
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- A processor accesses a particular ready list 
at rate 4/m, regardless whether it is queued 
for some other ready list. 


This means that the processors engaged in 
ready list access rejoin local queues with proba- 


bila ty: m7+ and at mean intervals A71. This does 
not present any difficulty with respect to the de- 
centralization scheme, or the implementation of 
ready list accesses. With this assumption, the 
birth-death formulation for the local queue (X: 
for ready list j, j=1 tom) is as follows): 


hy = A(n-x)(1/m) for x=0 ton. Discouraged 
Arrivals. 
O For pea Oe 
ty = Steady Service. 
u for x = 1 ton. 


where x denotes the 
in accessing L;- 


number of processors engaged 


This birth-death formulation does not account 
for all the self regulation. If i is the total 
number of processors queued for all the ready 
lists, then the overall discouraged arrival rate 
for this formulation is: 


m 1, 2m 
) (n-x, )(A/m) = »- ) “3 
j=l j=1 


nam) 


which is greater than the "true" overall dis- 
couraged arrival rate of A(n-i). For this reason, 
the approximation based on the above birth-death 
formulation slightly overestimates E(N), and 
underestimates E(K) =n - E(N). In other words, 
the assumption where the processors rejoin local 
queues at mean intervals 1/X guarantees that the 
throughput results based on this approximation are 
not overly optimistic. 


Carrying out the single-server analysis, we 
first obtain the mean local queue: 
ix.) = G(n-1) where G(°) is d.f. for 
E\X;) = 2 - “acy Poisson with mean m/p. 


The approximation for E(K) is then: 


2 


m- G(n-1) 
p 


G(n) oo 


m 
) B(X;) = (m-1)n. 


where G(*) is the cumulative d.f. for 
Poisson with mean m/p. 


The results of Random-Routing Model for proba- 
bilistic case are shown in Figure-4, and compared 
with best-case and worst-case results for E(K). 

It is clear that, even for a heavy ready list 
access traffic Ca=sh)s the results of the random- 
access policy is remarkably close to the best-case 
which corresponds to the ideal speedup factor. In 
(1), three results have been compared to the 
approximation with "mean think time” suggested in 
(8) and found to be very close. 


6.2 Concluding Comments on Proposed Policy 


When we employ m n ready lists, with random- 
access policy, speedup factors very close ton are 
achievable. If a is very high and the results are 
not very close to best-case, we can employ a larger 
number (m > n) ready lists to bring the throughput 
results closer to the best-case. Table-2 shows 
this for a .25. It must be kept in mind that 
the approximation is guaranteed not to overestimate 
E(K), as mentioned earlier. 


The random-access policy is a viable method 
for avoiding centralized mechanisms and the result- 
ing performance bottleneck conditions. For the 
MIMD system where the number of processors may be 
very large, avoidance of bottleneck conditions is 
an important performance issue. 


E(K) = Expected Number of Processors 


Number of in Normal Execution With Best- 
Processors Randomized-Access Policy Case 

(n) m=n m=n+5 m=n+10 R(K) 

> 3.56 3.66 3.69 Sx (3 

10 6.93 Bess: 1223 7250 

LS 10.27 2055 d:Oee (al. nn Beene. 

20 13.61 13.94 1h.1h 2500 

25 16.905 ign 17255 18.75 

30 20.28 20.67 20.94 22750 
Table-2: Throughput Results for the Proposed Policy 


(m>n, = 125. 


On the other hand, the Random-Routing Model 
takes into account only the ready list accesses. 
Other PMS system tables are not in the decentrali- 
zation scheme. Other factors, such as what happens 
when a ready list is found empty, are not considered. 
A simulation model which takes these factors into 
account and uses the proposed ready list access 
policy is discussed in (1). The results of that 
Simulation is also very encouraging about the pro- 
posed randomized-access policy. 
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Figure-e: Effective Throughput Results for 
Centralized PMS Model. 
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ASSIGNING PROCESSES TO PROCESSORS IN DISTRIBUTED SYSTEMS 
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Abstract A message-based approach to _ interprocess 
communication ts widely accepted for distributed computing. We 
present objectives necessary for assigning processes to processors in 
a distributed environment. Two objectives have previously been 
identified, neither has considered actual message delays. We give 
two new objectives and show how all four objectives are related to 
actual message delays. The importance of these objectives is 
illustrated by a realistic example from numerical analysis. This 
example was run on a testbed, consisting of a compiler and 
simulator used to run CSP-like programs on user specified 
architectures. 


1. Introduction 


When the processes of a distributed program are assigned to 
processors in a distributed environment, heuristic algorithms have 
traditionally only considered minimizing the communication among 
processors and balancing the load over the processors [2, 3, 5, 6, 8}. 
We have found two new objectives for the assignment problem, and 
shown how all four conflicting objectives are interrelated. Our 
studies have shown that actual message delay is an important 
consideration. We have developed a heuristic algorithm for 
assigning processes to processors which incorporates these 
objectives; this algorithm is presented in [9]. For this paper we 
motivate why these four. objectives are important and present an 
example to illustrate the varying importance of these objectives. 


We assume that the processes of a distributed program can 
usually execute at the same time. The objectives are for a set of 
processes where there is minimal ordering on the processes. Thus 
in a program where the processes are strongly ordered, the 
objectives may be applied to each subset of processes where the 
processes in each subset can usually execute at the same time. 


2. Message Delay in a Distributed 
Environment 


We first describe our distributed system of processors. We 
include the physical characteristics of the distributed architecture 
by considering processors with different speeds, and lines with 
different capacities and lengths that connect the processors. We 
assume that any processor can communicate with any other 
processor by routing messages through intermediate processors over 
fixed paths. We define virtual line time for a message between 
two processors connected directly by a line as a function of lower 
level protocols, message length in message units, number of bits per 
message unit, line capacity, and line length. Virtual line time does 
not include the time 2 message waits to use the communication 
subnet. Virtual line time for a message between two processors is 
the sum of the virtual line times for the lines on the route. 
Currently in local area networks, lower level protocols executing in 
the processors usually reduce the physical line capacity by at least 
a factor of ten for any message [1]. Virtual line time reflects this 
effective line capacity. 
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The message delay of a process for a blocking communication 
as in CSP [7] is a function of virtual line time, queueing at the port 
queues on the route in a store and forward network, and the 
processing, waiting, and queueing time of the corresponding process 
at its processor. Message delays can be very large compared to a 
process’s processing time between communications. 


The following limiting conditions generally hold when trying to 
minimize the total time a program requires. 


1. As all message delays -> 0, the optimal assignment is 
to cluster and assign processes such that the load is 
balanced over all processors. 


2.As all message delays -> infinity, the optimal 
assignment is to assign all processes to the fastest 
available processor. 


When message delay is considered it is important to realize that it 
may not be advisable to use all the processors. 


To illustrate we give a simple example. Consider two processors 
of equal speed connected by a line, and two communicating 
processes where each executes for time t and then sends a message 
to the other. We must determine for what message delay range 
both processes must be assigned to one processor and for what 
message delay range each must be assigned to separate processors. 
If both processes are on one processor the total time is 2*t. If there 
is a process on each processor then the total time is t+d, where d is 
the time for a message to go from one processor to the other. 
Thus, wien t>d it is better to assign each process to a different 
processor; when t<d it is better to assign both processes to one 
processor. Note that the relationship of t to d determines the 
number of processors used. 


3. Objectives 


Minimizing the total time a program runs on a distributed 
architecture is very much dependent on how the processes are 
assigned to processors. Objectives are given for assigning processes 
to processors to reduce the total time a program runs. 


3.1. Minimizing Interprocessor Communication 

It has been established that it is important to minimize the 
communication among processors by effectively partitioning the 
processes of a distributed program and assigning the partitions to 
the processors [2, 3, 5, 6, 8]. There are actually two minimization 
problems in this statement. We can minimize the interprocessor 
communication (1) using a fixed number of processors or (2) letting 
the number of processors vary. In case (2) the minimum is 
obtained by assigning all the processes to a single processor. 


We assume interprocess communication is independent of the 
assignment. To reduce interprocessor communication, processes 
which communicate the most are grouped (clustered) into one 
partition. We call each partition of processes a cluster. 


3.1.1. Minimizing Communication on n Processors 


We assume that if there are n processors, n clusters are formed, 
and each cluster is assigned to a different processor. We consider 
the following two functions to minimize: (a) the sum of the 
communication between each pair of clusters, and (b) the maximum 
communication between each pair of clusters. In a bus or 
broadcast network, interprocessor communication is reduced by 
minimizing (a). In a fully connected network with identical lines, 
interprocessor communication is reduced by minimizing (a) and (b). 
In a store and forward network with fixed paths through 


intermediate processors, interprocessor communication is reduced. 


by considering the minimization of both (a) and (b), and the paths 
in the network. The fixed paths need to be considered if a line is 
included in the paths between several different pairs of processors; 
that line can become a bottleneck. 


3.1.2. Minimizing Communication on Variable Number of 
Processors 


As fewer processors are used it may be the case in a store and 
forward network with fixed paths that there is less total 
communication on all lines but more communication on some line. 
For large message delays, queueing delays at the corresponding 
ports can get large. Thus in a store and forward network, fewer 
processors are used only if the communication on any line does not 
increase. 


3.2. Load Balancing 

Load balancing has also been established as an objective [2, 3]. 
Load balancing over all the processors is only important when all 
message delays are small. As message delays go to zero, 
communicating processes can be modeled as noncommunicating 
processes. We know that for noncommunicating processes total 
time is minimized if the load is balanced over all processors, 
This assumes that all processors can stay busy for the entire time. 
At zero virtual line time the delay for a process waiting on a 
message depends on the processing, queueing, and waiting times of 
the corresponding process at its processor. For this reason it may 
not be possible to keep each processor busy. However, load 
balancing over all processors for small message delays is a 
reasonable goal. 


3.3. Minimum Processing Requirement 


As message delays get large, there must be enough processing 
available at a processor so that when one process blocks for a 
message, other processes can execute until the desired message 
arrives. This reduces idle periods at a processor caused when all 
assigned processes are blocked waiting for messages. Thus, a 
processor must be assigned a2 minimum amount of processing, and 
this processing requirement depends on the message delays. Only 
after there is the minimum amount of processing at each processor 
can loads be balanced. As message delays get large, fewer 
processors are used. This objective establishes in part how many 
processors will be used; this is illustrated by the example given in 
the last paragraph of Section 2. We are not directly minimizing 
interprocessor communication by reducing the number of 
processors. To use this objective both the message delay and 
available processing at a processor before it becomes idle must be 
estimated. 


3.4. Closest Processor 
In a store and forward network with fixed paths, virtual line time 
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varies between pairs of processors. Clusters with largest 
intercluster communication are placed on processors connected by 
smallest virtual line time for a message unit. In a broadcast 
network or a fully connected network with identical lines, 
processors are equally close and thus this objective is not necessary. 


4. Example 


An example is now presented to illustrate the objectives given 
above. We solve Laplace’s partial differential equation, PDE, 
U,,+u,,=0 on a grid with the outer edges of the grid given as 
boundary conditions. This example was run on a testbed consisting 
of a compiler and simulator. It runs a CSP-like program on a 
distributed architecture specified by the user and includes the 
characteristics described in Section 2 [9]. The testbed was 
validated extensively using commercial analytical and simulation 
packages. The testbed provides confidence interval estimates at 
the 90% level with relative widths less than .05 for various 
performance measures. We have reported only the midpoint of the 
confidence interval for the measure, total time. 


4.1. Algorithm for Example 


The iterative method used is Gauss-Seidel. The grid is 
partitioned into subgrids where each subgrid is some number of 
contiguous rows. Each subgrid is solved by a process in the same 
way a sequential program would solve the entire grid. A grid value 
is computed as the average of its four adjacent neighbors; thus, to 
compute a row of values, the two adjacent rows are required. 
Hence, a process must request the two rows contiguous to its 
subgrid from its two neighboring processes. 


The usual way of executing the n processes is in pipeline fashion 
where if process n is on the kth step, process 1 is on the k+n-1 step. 
However, in our algorithm all processes begin at the same time and 
assume the requested rows are zero initially. In our algorithm each 
process is computing in the ith or i+15* process iteration at any 
given time. The algorithm converges because the Gauss-Seidel 
method converges for any set of starting values; the first j-1 
iterations for process j} compute a better set of starting values for 
its subgrid. 


The entire grid must be calculated at each iteration as long as at 
least one grid value has not met the convergence criteria. The 
communication structure is linear; thus, convergence over the entire 
grid is easy to detect and termination of all processes is easy to 
ensure. 


4.2. Analysis of Example 


The above algorithm was run on the testbed. Figure 6-1 shows 
the communication structure of the algorithm with each process 
represented by a circle (the process number in the circle), the total 
processing time requirement per process below each circle, and the 
amount of communication exchanged between two communicating 
processes above each line. Values for communication and 
processing time are obtained by running the program on the 
testbed with any assignment and architecture; for this program 
these quantities are independent of the architecture and 
assignment. 


The distributed architecture is three fully connected processors. 
Each processor has the same speed and each line is identical. The 


queueing discipline at each processor is preemptive priority with 


highest priority given to those processes which communicate across 
a line for the given assignment. We found that as message delays 
increased it was important to use the preemptive priority discipline 
and give priority to those processes which had to communicate 
over a physical line [9]. | 


We studied eight different assignments. The assignments with 
the total processing requirement at each processor and the 
communication on each line are given in Table 6-1. The total 
processing requirement at a processor is the sum of the processing 
requirement of each process assigned to the processor. Assignments 
RR, AO, and Al use all three processors; assignments A2, A3, A4, 
and A5 use two processors; and assignment A6 uses only one 
processor. Assignment AO has minimum variance of the total 
processing requirement at the three processors, and thus better load 
balancing than the other assignments. 


Table 6-2 gives the total time for the PDE program to complete 
execution for each assignment with several virtual line times for a 
message unit. The notation (1,2,3/4,5,6,7/x) denotes that processes 
1, 2, and 3 are assigned to processor 1; processes 4 through 7 are 
assigned to processor 2; and no processes are on processor 3. 


We now show how the objectives vary as message delays (virtual 
line time) increase. At virtual line time 57.9, the better load 
balanced assignments, AO and Al, have minimum total time. 
Minimizing interprocessor communication is not important. At 
virtual line time 592.2, the two processor assignments A3 and A4 
are better assignments than any one or three processor assignment. 
This illustrates that there must be a minimum amount of 
processing at a processor since they are preferable to any three 
processor assignment. At this virtual line time A2 compared to the 
other two processor assignments demonstrates that it is important 
to minimize the communication between the processors. At virtual 
line time 1388.6 it is better to put all processes on one processor; 
this is demonstrated by a performance improvement of greater 
than a factor of three when RR and A6 are compared. Looking at 
the entire Table 6-2, the trend to use fewer processors as message 
delays increase is shown by a * which marks the best assignment 
for each virtual line time. 


5. Conclusion 


In this paper we have presented four important objectives to 
consider when assigning processes to processors in a distributed 
environment. Two objectives are new. We have shown how all 


four objectives are affected by actual message delays. We have 
presented the results for one distributed program on architectures 


with various virtual line times; these results were obtained from a 
testbed for distributed computing. 
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Figure 6-1: Communication Structure and Processing Time 
Requirement of each Process of PDE Program 


processor number exchanges 
assignment i 2 3 on lines 
RR: processes 1,4,7 2,5 3,6 80 ,80,80 
load 23272 18738 19083 
AO: processes 1,2,7 3,4 5,6 40,40,40 
load 22927 #819083 19083 
Al: processes 1,2,3 4,5 6,7 40,40,0 
load 23987 19083 18023 
A2: processes 4,5,6 L2,a3f empty 80,0,0 
load 30857 30236 0 | 
A3: processes 1,2,3 4,5,6,7 empty 40,0,0 
load 23987 37106 0 
A4: processes 6,7 1,2,3,4,5 empty 40,0,0 
load 18023 43070 0) 
A5S: processes 1 2,3,4,5,6,7 empty 40,0,0 
load 5249 55844 0 
A6: processes 1,2,3,4 empty empty 0,0,0 
5,6,7 
load 61093 0 0 


Table 6-1: Assignments, Resulting Load at each Processor, 
and Message Exchanges per Line for PDE 


assignment virtual line time 
57.9 592.2 871.7 1388.6 
RR(1,4,7/2,5/3,6) : 99726 215993 


AO(1,2,7/3,4/5,6) : 24836 61297 84156 126548 
A1(1,2,3/4,5/6,7) : 24650*+ 55738 72119 102185 
A2(4,5,6/1,2,3,7/x) :.33271 82869 107460 155273 
A3(1,2,3/4,5,6,7/x): 38357  47990* 65405 97960 
A4(6,7/1,2,3,4,5/x): 45032 51085+ 66916 98358 
A5(1/2,3,4,5,6,7/x): 59069 58951 60613* 92485 
A6(all on 1 pr.) : 638636 63636 63636+ 63636*+ 


+ assignment generated by heuristic 


* assignment with minimum total time 


Table 6-2: Total Execution Times for Each Assignment 
for PDE ~ 
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Abstract -- Parallel processing systems, such as 
PASM, employ a large number of primary memory 
modules. A memory system organization using parallel 
secondary storage devices and double-buffered primary 
memories has been devised for PASM in order to prevent 
primary/secondary memory transfers from becoming a 
bottleneck. To efficiently use the memory system, it is 
desirable to overlap the operation of the parallel secon- 
dary storage devices with computations being performed 
by the processors. Due to the dynamically reconfigurable 
architecture of PASM, the processors which will execute 
a new task will not be selected until they are ready to 
execute the task. That is, to make effective use of 
double-buffering, a task must be preloaded prior to the 
final selection of the processors on which it will execute. 
Two schemes which allow for the parallel secondary 
storage devices to preload input data and programs into 
the primary memories so that system performance can 
be improved are presented and compared. Results show 
that both methods are effective techniques. 


I. Introduction 


In large-scale reconfigurable parallel processing sys- 
tems the transfer of data and programs between the pri- 
mary memories of the processors and the secondary 
storage can become a bottleneck. There are several 
types of reconfigurable parallel processing systems. A 
partitionable SIMD/MIMD system can be dynamically 
reconfigured to operate as one or more independent 
SIMD (single instruction stream-multiple data stream) [4] 
and/or MIMD (multiple instruction stream-multiple data 
stream) [4] machines (e.g., PASM HS), TRAC [8,16]). A 
multtple-SIMD system is a parallel processing system 
which can be dynamically reconfigured to form one or 
more independent SIMD machines of varying sizes (e.g., 
MAP [11,12]). When a partitionable SIMD/MIMD or 
multiple-S system is forming an SIMD machine, data 
must be loaded into the processors’ primary memories. 
When a partitionable SIMD/MIMD system is forming an 
MIMD machine, in addition to data, a program must be 
loaded into the primary memory of each processor which 
is executing the task. 

PASM is a partitionable SIMD/MIMD multimicro- 
computer system being designed at Purdue University 
for image processing and pattern recognition applications 
[18]. In order to prevent the primary/secondary memory 
transfers from becoming a bottleneck in PASM, a 
memory system employing parallel secondary storage 
devices and double-buffered primary memories has been 
devised [18]. To improve processor utilization by taking 
advantage of the double-buffering, it is necessary to over- 


This research was supported by the Air Force Office of 
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lap the operation of the parallel secondary storage dev- 
ices with computations being performed by the proces- 
sors. This overlap can be obtained by preloading the 
programs and data for the next task, while the previous 
task is being executed, and then by overlapping the 
unloading of output data with execution of the next 
task. 

Since PASM ts to be used in a research environment 
for parallel algorithm development (in some cases 
interactively), it is undesirable to require the user to 
specify the maximum allowable execution time of a task 
before it can be executed. The first-fit multiple-queue 
(FFMQ) scheduling algorithm, which has been described 
in oy , does not put this requirement on the user. If the 
FF scheduling algorithm is used, due to the dynami- 
cally reconfigurable architecture of PASM, it is not 
known a priori which task a given group of processors 
will execute. Since tasks must be preloaded prior to the 
final selection of processors, it appears that the system 
would be unable to preload tasks when using the FFMQ 
scheduling algorithm. The problem considered is how to 
determine where the data and programs for tasks can be 
preloaded while using the FFMQ scheduling algorithm so 
that the performance of the memory system can be 
improved. Without such a preloading scheme, the full 
potential of the double-buffered memory modules will 
not be realized. 

This paper presents preloading schemes which can 
be used in conjunction with the FFMQ scheduling algo- 
rithm. Two schemes which solve the preloading problem 
by determining which task’s or tasks’ programs and 
input data should be preloaded into a given set of pro- 
cessor memories are presented. The first scheme uses the 
scheduling algorithm to preschedule the task(s) which 
will follow the current task. The second scheme uses the 
scheduling algorithm to predict which task(s) may follow 
a given task. The performance of these preloading 
schemes as applied to PASM is demonstrated and con- 
trasted through simulation studies. The preloading 
schemes described can be adapted to other multiple- 
SIMD and partitionable SIMD/MIMD systems. 


Il, PASM Background 


PASM, a partitionable SIMD/MIMD machine, is a 
large-scale dynamically reconfigurable parallel processing 
system [18] (see Fig. 1). The System Control Unit is a 
conventional machine, such as a PDP-11, and is responsi- 
ble for the overall coordination of the activities of the 
other components of PASM. The Parallel Computation 
Unit (PCU) contains N = 2" processors, N memory 
modules, and an interconnection network (see Fig. 2). 
The PCU processors are microprocessors that perform 
the actual SIMD and MIMD computations. The PCU 
memory modules are used by the PCU processors for 
data storage in SIMD mode and both data and instruc- 
tion storage in MIMD mode. A pair of memory units is 
used for each PCU memory module so that data can be 
moved between one memory unit and secondary storage 
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Fig. 2. PASM Parallel Computation Unit (PCU). 


while the PCU processor operates on data in the other 
memory unit (double-buffering). A processor and _ its 
associated memory module form a PCU processing ele- 
ment (PE). The PEs are physically addressed from 0 to 
N-1. The pair of memory units forming the it? PCU 
memory module are labeled iA and iB (see Fig. 2). The 
interconnection network provides a means of communica- 
tion among the PEs. PASM will use either a Cube type 
[1] or ADM type [10] of multistage network. 

The Micro Controllers (MCs) are a set of micropro- 
cessors which act as the control units for the PEs in 
SIMD mode and orchestrate the activities of the PEs in 
MIMD mode. There are Q = 21 MCs, addressed from 0 
to Q-1. Like the PEs, the MC memory modules are 
double-buffered. Each MC controls N/Q PEs, where 
possible values of N and Q are 1024 and 16, respectively. 
An MC-group is composed of an MC processor, its 
memory module, and the N/Q PEs which are controlled 


by the MC. The N/Q PEs connected to MC i are those 
whose addresses have the value i in their low-order q bit 
positions (see Fig. 3). Control Storage contains the pro- 
grams for the MCs. 

A virtual machine of size RN/Q, where R = 2" and 
1<r<q, is obtained by combining the efforts of R 
MC-groups. According to the partitioning rule for 
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MSU 
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Fig. 3. Organization of the Memory Storage System for 
N = 16 and Q = 4. 


PASM [18], the physical addresses of these MCs must 
have the same low-order q-r bits so that all of the PEs 
in the partition have the same low-order q-r physical 
address bits. For example, for Q = 16, allowable MC 
partitions include: (6), (14), (2,10), (0,4,8,12), and 
(1,3,5,7,9,11,13,15). Q is therefore the maximum number 
of partitions allowable, and N/Q is the size of the smal- 
lest partition. The reason for using this particular parti- 
tioning rule is because it allows multistage networks like 
the multistage Cube and the ADM, which are being con- 
sidered for PASM, to be partitioned into independent 
subnetworks [17]. This rule is also valid for multistage 
Omega [9], shuffle-exchange [13], and indirect binary n- 
cube [15] networks, as well as other data manipulator [3] 
type networks such as the Gamma [14] network [17]. 

The designator of a virtual machine composed of an 
allowable partition is the smallest physical address of the 
MCs in the virtual machine. This _ designator 
corresponds to the low-order q-r bits of the physical 
address of each MC in the virtual machine. For the par- 
titions in the above example, the designators are: 6, 14, 
2, 0, and 1, respectively. 

The approach of permanently assigning a fixed 
number of PCU processors to each MC has the advan- 
tages that the operating system need only schedule Q 
MCs, rather than N PCU processors, and that it 
simplifies the MC/PE interaction, from both a hardware 
and software point of view, when a virtual machine is 
being formed. In addition, this fixed assignment scheme 
is exploited in the design of the Memory Storage System 
in order to allow the effective use of parallel secondary 
storage devices [18]. 

The FFMQ scheduling algorithm, which is being 
considered for use with PASM [20], makes use of q + 1 
first-in first-out task queues, TQ),TQ,,....TQ,. A task 
which requires 2* MC-groups is put into TQ,. Whenever 
there are free MC-groups, the FF MQ algorithm selects 
the first job in TQ,, where k is the largest integer such 
that a virtual machine of size 2* MCs is available for 
execution. If TQ, is empty, then the first task from 
TQ,-; is selected. This process is continued until all 


available MCs have been assigned or until k =0. The 
FFMQ algorithm assigns the task to the free virtual 
machine with the lowest designator. This is a 
nonpreemptive scheduling policy since all tasks run until 
completion and a multiple-queue scheduling. The 
FFMQ algorithm is a centralized scheduling algorithm 7] 
since the System Control Unit, which is executing the 
FFMQ algorithm, has complete accurate information 
regarding the states of all tasks in the system. 

When PASM is forming a virtual machine which is 
to execute an SIMD task, data must be loaded into the 
PCU memory units and a program must be loaded into 
the MC memory units. When forming a virtual machine 
which is to execute an MIMD task, both data and pro- 
grams must be loaded into each of the PCU memory 
units. In this paper the loading/unloading of data for 
SIMD tasks from the PCU memory units is considered. 
The loading of the SIMD program into the MC memory 
units is not considered since it can be overlapped with 
the loading of the data, following the same preloading 
scheme. The analysis in this paper can easily be 
extended to MIMD tasks; instead of loading just data, 
both programs and data would be loaded. 

The Memory Storage System, which provides secon- 
dary storage space for the PCU memory modules, con- 
sists of N/Q independent Memory Storage Units (MSUs). 
It is controlled by the Memory Management System. 
The MSUs are numbered from 0 to (N/Q)-1. Each is 
connected to Q PCU memory modules, as shown in Fig. 
3. For 0 <i < N/Q, MSU i is connected to those PCU 
memory modules whose physical addresses are of the 
form (Q*i) +k, O< kk <Q. For 0< k < Q, MC- 
group k contains those PCU processors whose physical 
addresses are of the form (Q * i) + k, 0<i< N/Q. 
Thus, MSU i is connected to the i*® PE of each MC- 
group. 

The two main advantages of this approach for a 
partition of size N/Q (i.e., one MC-group) are that (1) all 
of the PCU memory modules can be loaded in parallel 
and (2) the data is directly available no matter which 
partition (MC-group) is chosen. This is done by storing 
the data for a task which is to be loaded into the i* logi- 
cal PE of the virtual machine in MSU i, 0 <i < N/Q. 
Thus, no matter which MC-group is chosen, the data 
from the i*® MSU can be loaded into the i** PCU 
memory module of the virtual machine, for all i, 
0 <i < N/Q, simultaneously. 

Thus, for virtual machines of size N/Q, this secon- 
dary storage scheme allows all N/Q PCU memory 
modules to be loaded in one parallel block transfer. 
Consider the situation where a virtual machine of size 
RN/Q is desired, 1 < R < Q. Only R parallel block 
loads are required if the data for the PCU memory 
module whose high-order n-q logical address bits equal i 
is loaded into MSU 1. This is true no matter which parti- 
tion of R MCs (which agree in the low-order q-r address 
bits) is chosen [18]. 

A memory frame is the amount of space used in the 
PCU memory units for storage of data from secondary 
storage for a particular task. It is possible that a task 
may need to process more than one memory frame. 
Besides being used for preloading, the double-buffered 
PCU memory modules can also be used to overlap task 
execution on one memory frame with the loading or 
unloading of another memory frame. 


Il. Memory System Model 


In this section the model for the Memory System 
used for the analysis in this paper is described. A 


memory untt set is the set of PCU memory units within 
a single MC-group which have the same label (e.g., A, 
see Fig. 2). A data block consists of all the data to be 
loaded for one memory unit set. For a particular task 
which requires R MC-groups, there are R data blocks in 
a memory frame. 

When a task is assigned to an MC-group, one of the 
memory units of each PCU memory module within the 
MC-group is used by the task. Without loss of general- 
ity for SIMD tasks, it is assumed that all of the memory 
units within an MC-group which are used by a given 
task are in the same memory unit set. Hence, all of the 
memory units within a memory unit set will always be 
assigned to the same task and will have the same status. 
Since all memory units within a memory unit set always 
have the same status, in this model it is also assumed 
that the loading/unloading of data for a memory unit set 
is done simultaneously and is considered as one action. 
In general, this is also true for MIMD tasks. (However, 
it is possible that for MIMD tasks in which the PEs have 
differing secondary memory system requirements that 
these assumption may not hold.) 

All requests which are made to the Memory 
Management System and serviced by the Memory 
Storage System are for one data block. This results from 
the fact that the Memory Storage System can only 
load/unload one memory unit set at a time. There are 
three types of requests: load, preload, and unload. A load 
request is a request for input data for a task which has 
been assigned to its MC-group(s) (i.e., the MC-groups are 
ready to execute the task). A preload request is a 
request for input data for a task which has not been 
assigned to its MC-group(s) (i.e., the MC-groups are not 
ready to execute the task). An unload request is a 
request to unload output data for a completed task. 

Load requests have the highest priority since the 
MC-group which is associated with the request has 
already been assigned to the task and is idle waiting for 
its input data. Unload requests have the second highest 
priority since they are for tasks which have already com- 
pleted execution and the user is waiting for the output 
data. Preload requests have the lowest priority since the 
MC-group which is associated with the request is not idle 
waiting for the input data to be loaded. The Memory 
Storage System services requests from three request 
queues, one for each type of request. 


IV. Preloading Schemes 


Preloading enables the Memory Storage System to 
preload the data into the PCU memory units of a given 
virtual machine while the PCU processors of that virtual 
machine are still executing the previous task. In this 
section two task preloading schemes for use with the 
FFMQ scheduling algorithm are presented. These 
schemes make use of the double-buffered PCU memory 
modules. While a task is being executed using one of 
the memory unit sets of a given MC-group (e.g., the A 
memory units), the next task to be executed can be 
preloaded into the other memory unit set (e.g., the B 
memory units). Since there are only two memory units 
associated with each PCU processor, each processor can 
have at most two tasks associated with it. Hence, only 
single task look-ahead preloading is considered. In gen- 
eral, there will be more than one task preloaded into the 
system since different MC-groups can have different 
tasks preloaded. 

Preloading is driven by the size of a currently exe- 
cuting task. When a task of size 2* MCs, 0<k <q, 
begins to execute, the preloading of tasks of size ak (or 
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smaller) is considered for that set of MCs. Thus, a task 
of size 2’ MCs can be preloaded only if there are tasks of 
size 2' or greater currently executing. The two preload- 
ing schemes to be presented are prescheduling and pred- 
iction. 

Preschedultng. When prescheduling is used, the task 
manager attempts to schedule tasks in advance of when 
they would normally be scheduled. Whenever a task 
starts executing on a virtual machine, the prescheduler 
determines which tasks (if any) will follow the execution 
of the given task. If there are any tasks, the tasks are 
preassigned to the appropriate MCs to be their next task 
executed. The prescheduling algorithm uses the FFMQ 
scheduling algorithm [20]; but, instead of attempting to 
schedule tasks for the entire machine, the prescheduling 
algorithm attempts only to schedule the virtual machine 
(or MC-groups) which is executing the given task. When 
a task completes execution, if no tasks have been 
prescheduled to follow the completing task, the regular 
FFMQ scheduling algorithm is called. It is noted that 
task prescheduling supplements task scheduling, it does 
not replace it. 

The following example will illustrate the use of 
prescheduling on a PASM with four MC-groups (i.e., 
Q = 4). The status of the system is given as a function 
of time in Fig. 4. The status of the task queues are given 
whenever there is a change. At time 10.0 the system 
completes execution of tasks a, which required four 
MC-groups. Task § has already been preloaded, so it 
begins executing immediately. The prescheduling algo- 
rithm is called by the task manager to preschedule the 
task or tasks which will follow task 3. The FFMQ algo- 
rithm determines that tasks y and 6 (which each require 
two MC-groups) will follow @. Tasks y and 6 are 
removed from TQ, and are preassigned to the appropri- 
ate MCs. The task manager then requests that the data 
for tasks yy and 6 be preloaded. The Memory Storage 
System unloads the output data from task a and 
preloads the data for y and 6. Recall that the Memory 
Storage System is only able to transfer the data for one 
MC-group at a time. Hence it takes four transfers to 
unload task a, two to preload 4, and two to preload 6. 
At time 10.8 task @ is executing and tasks y and 6 are 
preloaded and ready to be executed. At time 12.0 task 3 
completes executing and tasks y and 6 start executing. 
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As indicated in Fig. 4, task ¢, which requires two 
MC-groups, was prescheduled to follow task y. So no 
matter when task y completes execution, task ¢€ will fol- 
low it. As a result, even though task ¢€ arrived to the 
system before task ¢, the MC-groups did not start exe- 
cuting it until 5.5 seconds after task ¢. As a result, the 
response time for task € is much greater than that of 
task ¢. This is an example of how the prescheduling 
scheme alters the order which task are executed. 

Predtciton. As with prescheduling, the prediction 
preloading algorithm is invoked each time a task starts 
executing. When predtctton is used, the task manager 
predicts which task or tasks may follow the task which 
started executing. Unlike the prescheduling scheme, the 
task(s) are not removed from the task queue. The pred- 
iction scheme uses the FFMQ scheduling algorithm to 
predict which tasks may follow a given task. The 
predicted task or tasks are then preloaded by the 
Memory Storage System into their predicted memory 
units. The same enqueued task may be predicted and 
preloaded to follow each currently executing task whose 
size is equal to or greater than that of the enqueued 
task. The FFMQ scheduling algorithm is executed 
whenever a task completes execution (i.e, when MCs 
become free) and whenever a new task arrives to be 
scheduled [20]. When a task is scheduled for execution 
by the FFMQ scheduling algorithm, the assignment of 
tasks to MC-groups is made as if no preloading had 
taken place. When a task is assigned, the task manager 
sends requests to the Memory Management System indi- 
cating that the Memory Storage System should load the 
data. Recall that one data block is sent to each MC- 
group and that the task manager sends a separate 
request for each data block. After the Memory 
Management System receives the data block requests, it 
voids any requests for data blocks which have been 
preloaded. Doing data requests on a block by block 
basis allows the system to take advantage of and to 
account for partial preloading. Having the task manager 
requests all data blocks (regardless of whether they have 
been preloaded) removes the burden of keeping track of 
preloaded data from the task manager (which executes 
on the System Control Unit). 

The following example will illustrate the use of 
prediction on a PASM with four MC-groups (i.e., 
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Fig. 4. An example of the use of the prescheduling scheme for determining where input data 
should be preloaded. The status of a PASM with four MC-groups (Q = 4) is shown. 
Status of the eight (2Q) memory unit sets, the three (q + 1) task queues (for schedul- 
ing), and the Memory Storage System are given. Shaded area indicates when a memory 
unit set is being accessed (either by loading, unloading, or preloading) by the Memory 


' Storage System. 


“L,” “P,” and ‘‘U” indicate that the Memory Storage System is load- 


ing input data, preloading input data, and unloading output data, respectively. 
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Fig. 5. An example of the use of the prediction scheme for determining where input data 
should be preloaded. Notation is the same as used in Fig. 4 


Q = 4). The status of the system is given as a function 
of time in Fig. 5. The status of the task queues are 
given whenever there is a change. At time 10.0 the sys- 
tem completes execution of tasks a, which required four 
MC-groups. The scheduler determines that task / will 
be executed next by the system. Since task @ has been 
preloaded, execution begins immediately. The prediction 
algorithm is called by the task manager to predict which 
task or tasks may follow task 6. The FFMQ algorithm 
determines that tasks y and 6 (which each require two 
MC-groups) may follow (. The task manager then 
requests that the data for tasks y and 6 be preloaded. 
Note that tasks y and 6 are not removed from the 
scheduler task queues as they were for the prescheduling 
scheme. The Memory Storage System unloads the out- 
put data from task a and preloads the input data for ¥ 
and 6. Recall that it takes four transfers to unload task 
a, two to preload 4, and two to preload 6. At time 10.8 
task f is executing and tasks y and 6 are preloaded and 
ready to be executed. At time 12.0 task @ completes 
execution. Since there are free MC-groups, the scheduler 
is called by the task manager. The scheduler then 
selects tasks y and 6 to be assigned to the free MC- 
groups. The tasks are then assigned and begin execution 
immediately since they have both been preloaded. Up to 
this point, the results of prescheduling and prediction are 
the same. 

As indicated in Fig. 5, task € was predicted to follow 
either task y or 6. Therefore, it was preloaded into the 
MC-groups forming the virtual machines for both tasks. 
In this way, task € can be executed by the virtual 
machine which becomes available first, preserving the 
FFMQ ordering. Thus, the normal scheduling policy is 
maintained with prediction and task € does not experi- 
ence the excessive delays that it did with prescheduling. 
Also note that in this particular example the structure of 
the Memory Storage System allows ¢€ to be loaded into 
both MC-groups simultaneously. Since task 6 was com- 
pleted first, task € was scheduled to follow it, and a new 
task was predicted to follow tasks y and e. 

Summary. With the prediction scheme, the task 
manager predicts where the enqueued task might execute 
and preloads the data into the appropriate PCU memory 
units. The enqueued task may be loaded to follow more 
than one currently executing task. Independently of the 
preloading which has occurred, the scheduler selects 
which task will be executed next and to which MC- 
group(s) the task will be assigned. Thus, prediction does 
not alter the natural order in which tasks would have 
been scheduled without preloading. In contrast, the 
prescheduling scheme has the disadvantage that it alters 
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the order which the tasks are executed from the natural 
order resulting from the use of the FFMQ scheduling 
algorithm. For example, when prescheduling was used, 
task ¢ was executed before task € even though task € was 
first in the task queue. As a result, prescheduling greatly 
increased the response time for task e. 

The prescheduling scheme has the advantage that it 
does not do any unnecessary loading of tasks which may 
not be used, 1.e., with prescheduling a task is preloaded 
(or loaded) only one time. However, with prediction, a 
task may be preloaded one or more times. For example 
in Fig. 6, the two MC-group task 6 was predicted and 
preloaded to follow task y. Since both tasks a and £ 
completed execution before y, task 6 did not follow task 
ry. Hence, unnecessary loading of task 6 occurred. 

To demonstrate how it is not clear which preloading 
algorithm will yield better performance, consider the 
simple examples given in Figs. 7 and 8. In the example 
in Fig. 7, the system variation using the prescheduling 
scheme completes all of the tasks first. On the other 
hand, in the example in Fig. 8, the system variation 
using the prediction scheme completes all of the tasks 
first. Both preloading schemes have advantages and 
disadvantages. In order to evaluate and quantify their 
relative performance, simulation studies were conducted. 
These studies are described in Section V. 

The preloading schemes can use any scheduling 
algorithm, they are not limited to the FFMQ scheduling 
algorithm. Since the preloading schemes use the proces- 
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Fig. 6. Status of a PASM with four MC-groups, with 
notation as defined in Fig. 4. _ Illustrates 
unnecessary preloading which can occur from 
using prediction scheme. 
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.7. Status of a PASM with two MC-groups, with 
notation as defined in Fig. 4. Example of case 
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where the (a) prescheduling scheme yields better 
performance than the (b) prediction scheme. 
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Fig. 8. Status of a PASM with two MC-groups, with 
notation as defined in Fig. 4. Example of case 
where the (b) prediction scheme yields better 
performance than the (a) prescheduling scheme. 


sors currently assigned to a given task, they do not have 
to have a fixed MC-group structure. Hence, these 
hardware/software schemes can be adapted for use in 
other multiple-SIMD and partitionable SIMD/MIMD sys- 
tems. 


V. Performance Analysis 


A PASM with 16 MCs (Q = 16) and 1024 PEs 
(N = 1024) was simulated using the PASM Operating 
System simulator, a discrete event simulator [5], under 
four variations in the control strategy used by the 
memory system (details of the simulations are given in 
21)). 


1. A PASM without double-buffered PCU memory 
modules was considered, i.e., only one memory unit 
per PE. This allowed for no overlapped loading or 
unloading of data, and is examined to demonstrate 
the need for the double-buffered PCU memory 


modules. 


2. A PASM with double-buffered PCU memory modules 
was considered, i.e., two memory units per PE. With 
this variation the second memory unit was used for 
doing overlapped unloading of the output data from 
the previous task, but no preloading of the input data 
for the next task, i.e., there is no preloading scheme 
employed. 


3. A PASM with double-buffered PCU memory modules 
was considered, using the prescheduling scheme for 
determining where to preload input data. 


4. A PASM with double-buffered PCU memory modules 
was considered, using the prediction scheme for 
determining where to preload input data. 


Performance measures to be considered are MC util- 
ization, average load delay time, and average response 
time. The MC uttlization is the fraction of time that the 
MCs are active during the simulation, specifically, the 
total MC active time, divided by Q and by the total 
simulation time. MC utilization has been selected since 
the utilization of the MCs reflects the utilization of the 
PEs. 

The average load delay time is the average delay 
time to load the memory frame for a task (see Fig. 9). 
The load delay time for a given task is the delay 
between the time when the MC-group(s) are ready to 
execute the task and the time when the task starts exe- 
cuting. This is of interest since it directly shows the 
decrease in the time the processors are idle waiting for 
data to be loaded. 

The response time for a task is the delay between 
the time when the task arrives at the system and the 
time when the task completes execution on the system 
(see Fig. 9). The average response time is calculated by 
accumulating the response time for each task executed 
and dividing it by the number of task completions. The 
response time is being considered since a decrease in 
response time has the greatest direct effect on the user. 
Response time is a significant measure since it is 
expected that PASM will often be used interactively. 
Interactive users might be experimenting with different 
sequences of image processing algorithms on _ large 
images. It is desirable to be able to very parameters for 
the algorithms and see the results in a reasonable 
amount of time (i.e., short response times). 

The system throughput is the number of tasks com- 
pleted per second by the system. It is not considered in 


detail since it is not an accurate performance measure 
for this type of analysis. The system throughput does 
not take into account the number of MC-groups the 
tasks required. For example, for two system variation, 
the throughput could be the same, but for one variation, 
the system could be completing all 16 MC-group tasks 
and for the other the system could be completing all one 
MC-group tasks. Hence, for the system throughput to 
be of use, it is necessary to weigh it with the task size, 
which is equivalent to looking at the MC utilization. 


TASK K T TASK TASK 
ARRIVES ASSIGNED EXECUTION EXECUTION RESULTS 
BEGINS ENDS UNLOADED 
LOAD DELAY EXECUTION 
ia TIME TIME = 
RESPONSE TIME 


Fig. 9. Time-line which illustrates the definitions of load 
delay time and response time. 
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Since each memory unit set is preloaded indepen- 
dently, it is possible for a task to be partially preloaded 
when the previous task completes execution. Therefore, 
when the new task is assigned, it is only necessary to 
load the memory units sets which were not preloaded. 
The simulator is able to account for partial preloading. 

The value of N (the number of PEs) is not varied 
since it would not effect the performance of the system. 
For example, if N were doubled, there would still be 16 
MC-groups, but each MC-group would have twice as 
many PEs. Since there would also be twice as many 
MSUs (since there are N/Q MSUs), all of the memory 
units within one memory unit set could still be loaded in 
one parallel block transfer. Hence, the results of the 
simulations would not be effected. 

On the other hand, if the value of Q was doubled, 
there would 32 MC-groups and only 32 MSUs. This 
change enables the system to execute tasks which require 
32 PEs, in addition to the other possible task sizes. 
Since there are half as many MSUs, it would take twice 
as many parallel block transfers to load the PCU 
memory units for a task. Hence, doubling the value of Q 
would have a similar effect to doubling the time to 
load/unload a data block for an MC-group, as done in 
Experiment 2. 

‘In computer systems, the arrival of individuals at a 
card reader, the failure of circuits in a central processor, 
and requests from terminals in a time-sharing system are 
processes that are essentially Poisson in nature.” [5] 
Since PASM serves requests from terminals (as does a 
time-sharing system), task arrivals are generated with a 
Poisson process. The mean task interarrival time was 
selected to be 20 simulation seconds. A uniform distri- 
bution is used for determining the number of MC-groups 
a task requires. Each simulation run was for 20,000 
‘‘PASM seconds” and had approximately 1000 tasks exe- 
cuted. The performance analysis has been divided into 
two experiments. 

Experiment 1. In this experiment the distribution 
for the task execution time was chosen to be exponen- 
tial. The mean execution time was varied from five to 
50 simulation seconds. The time to load/unload a data 
block for an MC-group has been selected to be 0.090 
simulation seconds. This load time is based on the time 
to load 64 kilobytes of data into a memory unit assum- 
ing that each MSU was a CDC BK7XX Storage Drive 
Module (disk) [2]. This time accounts for the seek and 
latency times of the disk which can be overlapped with 
the time to set the Memory Storage System busses. 
However, it does not account for any overhead from file 
system actions which should be insignificant when com- 
pared to the seek and latency times. 

The average response time is given for the four vari- 
ations in control strategy as a function of the average 
task execution time in Tab. 1. In [19] it has been deter- 
mined both analytically and by simulation that the aver- 
age number of tasks being executed by the system for a 
uniform distribution of task sizes is 2:58 times the MC 
utilization. Hence, if the system is to be 100 percent 
utilized, the mean task execution time must be at least 
51.6 seconds (if the mean interarrival time is 20 seconds). 
Therefore, when the average execution time is small (less 
than 20 seconds), the system (or MC) utilization is low 
(less than 40 percent, see Tab. 2). With the utilization 
so low, there are usually free MC-groups and tasks can 
normally be scheduled immediately upon arrival. If 
tasks are scheduled immediately upon arrival, there is no 
time period in which tasks can be preloaded and as a 
result no preloading of input data occurs. Hence, for 


Tab. 1. Average response time (in simulation seconds) is 
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given for the four variations in control strategy 
as a function of the average task execution time 
(in simulation seconds). 


Average Response Lime 


Tab. 


2. MC utilization is given for the four variations 
in control strategy as a function of the average 
task execution time (in simulation seconds). 


Micro Controller Utilization 


small average execution times (less than 20 seconds), the 
response time for all variations in control strategy is 
about the same. 

The prescheduling scheme alters the order in which 
tasks are scheduled from the normal scheme since the 
tasks are scheduled in advance. The use of the 
prescheduling sometimes results in bad scheduling deci- 
sions (or miss-scheduling). For example, consider a 
PASM with two MC-groups (Q = 2) which is executing 
tasks a and #, each requiring one MC-group (see Fig. 
8a). Task y has been prescheduled to follow a. Task @ 
completes execution before task a. Now MC-group 1 is 
idle and task ¥ is waiting to be executed on MC-group 0. 
Hence, task yy has been miss-scheduled resulting in 
increased response time for task 7. These bad decisions 
will have little effect when the average task execution 
time is small. However, when the average task execution 
time becomes large (i.e., greater than 25 seconds), the 
bad decisions have greater effect. This effect is illus- 
trated by the average response time for the preschedul- 
ing scheme becoming greater than the average response 
time for the prediction scheme for large execution times 
(see Tab. 1). 

The MC utilization for the four system variations in 
control strategy is given in Tab. 2. For execution times 
of less than 40 seconds, the system is able to service all 
of the arriving tasks (task arrival rate equals 
throughput) under each variation in control strategy. As 
a result, for small execution times the MC utilization is 
the same for all control strategies since the same set of 
tasks is being executed. As the average execution time 
increases, the throughput of tasks requiring 16 MC- 
groups decreases for the single and double variations. 
Since the system is executing fewer 16 MC-group tasks, 
the MC utilization is lower for the variations without 
preloading. When the average execution time is 50 
seconds, the 16 MC-group task throughput is less for 
prescheduling than prediction, resulting in the difference 


in MC utilization. This occurs since the prescheduling 
scheme can only preschedule tasks which require the 
same number or fewer MC-groups than a currently run- 
ning task. Therefore, a 16 MC-group task can only be 
prescheduled, if there is a 16 MC-group task running. 
Hence, the prescheduling scheme tends to favor tasks 
which do not require 16 MC-groups. As the demand on 
the system increases (e.g., longer average task execution 
times), the MC utilization becomes limited with the sin- 
ele and double variations since the processors cannot be 
utilized while they are waiting for data and programs to 
be loaded and unloaded. Hence, the maximum allowable 
system load (utilization) is higher when the preloading 
schemes are used. 

In Tab. 3 the average task load delay time is given 
as a function of the average task execution time for the 
four variations in control strategy. This is given to show 


Tab. 3. Average load delay time (in simulation seconds) 
is given for the four variations in control stra- 
tegy as a function of the average task execution 
time (in simulation seconds). 


how the preloading schemes reduce the load delay times 
for tasks. Consider the single-buffered variation. As the 
average task execution time increases the system utiliza- 
tion approaches one (see Tab. 2). When a task is 
scheduled it is usually being executed by MC-groups 
which have just completed executing another task. As a 
result, the new task must wait for the output data from 
the previous task to be unloaded and its input data to be 
loaded. Hence, the load delay time increases with 
increased utilization for the single-buffered variation. 
However, with the prediction and prescheduling schemes, 
longer task execution times allow more time for task 
preloading. Therefore, for large task execution times, 
the average load delay time approaches zero for both 


preloading schemes (see Tab. 3). The average load delay 


time will never reach zero since there are constraints on 


when preloading is possible (e.g., cannot preload tasks 
which require more MCs than any given currently exe- 
cuting task) which will always prevent some tasks from 
being preloaded. The average load delay time for the 
prediction scheme does not approach zero as rapidly as it. 
does for the prescheduling scheme since some tasks are 
not executed by the virtual machine in which they were 


preloaded (e.g., task 6 of example in Fig. 6). 


In summary, for small execution times (less than 20 


seconds) the system performs the same for all variations 
in control strategy. For large execution times (greater 
then 20 seconds) the prediction scheme performs best. 
For a given task, the load delay time is a component of 
the response time (see Fig. 9). As a result, load delay 
time does not indicate the direct effect on the user, as 
does the average response time. Hence, the lower aver- 
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Tab. 4. Average response time (in simulation seconds) is 
given for the four variations in control strategy 
as a function of the time to load/unload one 
data block (in simulation seconds). 


_ Average Response Time 


age response times (for larger executing times) provided 
by the prediction scheme are more significant than the 
lower average load delay times provided by _ the 
prescheduling scheme. Therefore, this experiment indi- 
cates that the prediction scheme is the method of choice. 

Experiment 2. In this experiment the distribution 
for the task execution time is exponential with mean 
task execution time of 25 simulation seconds. The time 
to load/unload a data block for an MC-group is varied 
from 0 to 0.315 simulation seconds. Varying the time to 
load/unload a data block could result from varying the 
size of the data block (which would result from varying 
the size of the PCU memory units) or from changing the 
type or speed of the secondary storage device used by 
the MSUs. For example, the time to load 64 kilobytes of 
data from a disk which employs ‘‘Winchester” technol- 
ogy can range from 0.2 to 0.4 seconds, depending on the 
particular manufacturer (e.g., for the Hewlett-Packard 
Model 7910, the average load time is 0.236 seconds [6}). 

The average response time is given for the four vari 
ations in control strategy as a function of the time to 
load/unload a data block in Tab. 4. Note that in Tab. 1 
the average response time was given as a function of the 
average task execution time, while in Tab. 4 it is given 
as a function of the time to load/unload a data block. If 
the load/unload time is zero, the average response time 
for the single, double, and prediction variations is the 
same since loading and unloading a task requires no 
time. The response time for the prescheduling variation 
is greater than the other variations when the 
load/unload time is zero since with the prescheduling 
variation, the system is still prescheduling tasks, result- 
ing im some_ miss-scheduling. Hence, the zero 
load/unload time case directly illustrates the increase in 
response time resulting from the use of prescheduling. 

As the load/unload time increases, the average 
response times for the single-buffered variation increase 
at a greater rate since the load and unload time for a 
task must be added to the execution time of every task. 
For all load/unload times, the prediction scheme yields 
the lowest response times. For load/unload times of 
greater than 0.045 simulation seconds, the prescheduling 
scheme yields lower average response times than the 
single-buffered variation, and for load/unload times of 
greater than 0.180 the prescheduling scheme yields lower 
average response times than the double-buffered 
variation without preloading. These cross-overs in the 
average response time occur since the benefit of the 
preloading (from the use of prescheduling) becomes more 
significant with greater load/unload times (and over- 
comes the increase resulting from miss-scheduling). 


VI. Conclusion 


Two schemes which can be used with the FFMQ 
scheduling algorithm for preloading input data into the 
PCU memory modules have been presented. The two 
schemes (prescheduling and prediction) make use of the 
double-buffered PCU memory modules. Since both 
schemes have advantages and disadvantages, in order to 
evaluate and quantify their relative performance, it was 
necessary to conduct simulation studies. The perfor- 
mance of the system has been evaluated with four varia- 
tions in control strategy. It has been shown that the use 
of the double-buffered memory modules for overlapping 
the unloading of the output data from the previous task 
with the execution of the next task results in a 
significant decrease in average response time. Further- 
more, it has been shown that the average response time 
can be decreased more significantly by using the double- 
buffered memories for input data preloading (along with 
overlapped unloading). When the system becomes 
heavily loaded, the system performs better with the 
prediction scheme than with the prescheduling scheme 
since the prescheduling scheme alters the natural order- 
ing of the tasks which results from using the scheduling 
algorithm. However, the prescheduling scheme has the 
advantage that it does not do any unnecessary loading of 
input data which may not be used. The prediction 
scheme also has the advantage that in the worst case the 
resulting system performance will never be worse than 
that of the overlapped unloading case since the same 
scheduling order is maintained and all preloading is done 
with lower priority. This claim cannot be made for the 
prescheduling scheme since it alters the scheduling order. 

In summary, the ‘‘prediction” preloading scheme 
makes good use of the Memory Storage System architec- 
ture and the double-buffered PCU memory modules. It 
overcomes the problem of how to determine where the 
system can preload tasks prior to final processor selec- 
tion. Thus, the double-buffered primary memory - paral- 
lel secondary storage device organization can be 
exploited for overlapped loading of tasks as well as over- 
lapped unloading. The preloading schemes can use any 
scheduling algorithm and can be adapted for use in other 
multiple-SIMD and partitionable SIMD/MIMD systems. 
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CONSTRUCTING A PARALLEL IMPLEMENTATION FROM HIGH-LEVEL 
SPECIFICATIONS: A CASE STUDY USING RESOURCE EXPRESSIONS 
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Chapel Hill, NC 27514 


Abstract -- Resource expressions are high-level 
specifications of the coordination of concurrent 
requests to access a shared resource. This paper 
presents an operational semantics for resource 


expressions and shows how it is used to systematically 
construct an implementation for resource expressions. 
The operational semantics defines the necessary 
conditions to be tested and the actions to be taken 
when a condition holds. To simplify the definition of its 
semantics, resource expressions are represented in an 
intermediate form consisting of a set of condition- 
action pairs. The implementation of resource 
expressions represented in this intermediate form isa 
parallel program constructed from a set of queueing 
primitives and primitives for arbitration and parallel 
execution. This implementation is presented here by 
showing informally how the semantics of conditions and 
actions are realized by the primitives. 


1. Intreduction 


The protection and sharing of resources are 
central to the construction of parallel programs. A 
resource is assumed here to be any data object 
together with a set of coordinated operations on this 
object. Coordination of these operations is necessary 
in order to maintain the consistency of the data object, 
or to optimize the amount of parallel execution of 
these operations. Hesource expressions are a high- 
level language extension for specifying constraints, 
such as mutual exclusion, priority of operations, etc. 
Although their initial development was in the context of 
a functional language [9], they can be used in 
conventional languages as well. They are closely related 
to path expressions [3] in their basic approach to 


specification, but there are important semantic 
differences between the two languages. We use the 
term resource expression, rather than path 


expression, in recognition of the differences in their 
semantics. The reader is referred to [9] for a broader 
overview of our approach. 


Resource expressions are basically regular 
expressions [10] extended with conditions and 
constructs for concurrent operations. For example, 
the resource expression 


((write)* + [read])* 


specifies a simple version of the readers-and-writers 
constraint of [5]. The operators "*' and '[]" specify 
zero or more sequential and parallel iterations of their 
respective bodies. Thus the subexpression (write)* 
specifies sequential execution of write requests, and 
the subexpression [read] specifies parallel execution of 
read requests. The operator "+" denotes selection of 
one of its operands, hence all read requests execute in 
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mutual exclusion of all write requests. Since the 
selection of one of the operands of "+", and actual 
number of iterations performed by "*" and "[]" can be 
time-dependent, these operators are nondeterministic 
operators. 


To exercise greater control over the selection of 
alternatives, we specify conditions that must hold 
before any alternative is selected. The conditions used 
are often thresholds on the number of requests 
present for any operation. We use $x to refer to the 
number of requests present for operation x. For 
example, the resource expression 


( (write)* + ($write=0 and $read>0) [read] )* 


specifies priority for write requests, because [read] 
can be selected only when there are no write requests 
present. However, once a read request is being 
executed, further read requests may be executed even 
if a write request should then become present. 


The expression 
( (write)* + [($write=0 and $read>0)read] )* 


on the other hand, specifies stronger priority for write 
requests than the previous expression, because the 
absence of a write request is tested before every read 
request is executed. 


Another basic primitive is sequencing, and is 
denoted by ".". For example, (f.g)* specifies that 
requests for operations f and g are to be executed in 
strict alternation. The expression f.g + h.k specifies a 
selection of only one of these two sequences, f.g or h.k, 
depending upon which sequence is ready to be selected 
first. There are three possible criteria for the selectior 
of f.g (and similarly h.k): 


1. Without testing for a request for either f or g; 
e. AS soon as a request for f is present; 


3. Only when a request for both f and g are present. 


We adopt the first criterion here, since the other 
two can be specified by means of explicit conditions 
before a sequence. For example, the second criterion 
can be stated as 


($f>0)f.g 
and the third criterion by 
($f>0 and $g>0)f.g 


This reflects the view that the conditions upon which 
actions are taken in a resource controller should be 
explicitly stated by the programmer. Although the 
specifications tend to become longer, the resulting 
programs become more self-documenting. 


The syntax of resource expressions, for the 
purpose of discussion in this paper, is given by the 
following grammar. We use {} to denote zero or more 
repetitions of the enclosed rule. 


RE --> RF §+ RF} 

RF --> RG | (COND) RG 

RG --> RH . RH 

RH --> op |({RE)* | [RE2] 

RE2 --> RF2 {+ RF2} 

RF2 --> RG2 | (COND) RG2 

RG2 --> op 

COND --> REL fand REL} 

REL --> (#op=0) | ($op > NATNO) 
NATNO --> O |1 |[@ |... 


where op is the name of an access operation. 


We have omitted discussion of the operators '{}" and 
"#" of [9] in the presentation here due to shortage of 
space. 


2. Operational Semantics 


In order to simplify the presentation of the 
definition of its semantics, a resource expression is 
converted into an intermediate form, which consists of 
a set of condition-action pairs, similar to Dijkstra’s 
guarded commands [6]. The general form of these 
condition-action pairs is 


({c; a;){Cg ag) ++ * (Cp ay)), 
where 


{c; a,) is a condition-action pair, one for each term 
separated by "+", where 
c,is of the form (r, and rg and ... and r,) 
ajis of the form (x, X, ... x) 
where 
r; is either ($op,;>n,) or ($0p,=0), 
where op,... op, are operations of the 
resource and n, 1s a natural number, 
x, is an operation or (STAR I) or (BRACKET J) 
where I is ((cj a1) --. (Cin Qn)) and STAR 


and BRACKET are functions for "*' and 
'l |" respectively (explained later). 
The intermediate form may be thought of as a 
different syntactic form of the given resource 


expression, in which conditions and actions are stated 
in a more uniform, but perhaps less readable, manner. 
The translation to the intermediate form is quite 
straightforward and is omitted here. For example, the 
intermediate form of the resource expression 


( ($x>0) (($x>1) x)*.x + ($y>0) [($y>0) y] )* 


is of the form 
(c a) 
a = ((STAR ((c1 a1) (c2 a2)))) 


where 

c = true, 

where 

e1 = ($x>0), ai = ((STAR (c11 al1)) x) 
where cll = ($x>1), a11 = (x) 


c2 = ($y>0), a2 = ((BRACKET (c21 c22))) 
where c2l1 = ($y>0), c22 = (y) 
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The operational semantics of resource expressions 
expressed in the above form is defined by showing the 
necessary conditions to be tested and the action to be 
taken once a condition holds. We define 


eval(I, R) 
which "evaluates'’ a resource expression | in the 
environment of a set of requests Rk, i.e. executes 


requests present in set R that satisfy the constraints of 
I. It is assumed that the content of set R can change, 
either because requests are added to it from outside 
the resource or because requests are removed from it 
in the course of evaluation of the resource expression. 


We take some syntactic liberties in the language 
used in defining the semantics. For example, if the 
structure of the argument to a function f is (ec a), we 
may write the definition of f as f({c a)) =... and use c 
and a directly in the body of f. We may also test the 
structure of the argument x of a function f(x) by a 
predicate such as x = (ec a), and then use c and a within 
the scope of the predicate. 


We present the semantics by first presenting the 
definitions, followed by a brief informal explanation of 
them. At the top level, we have 


eval{I, R) = evalaction(evalcond(I,R),R) 


where evalcond and evalaction specify the semantics of 
conditions and actions respectively. 


Semantics of conditions 


evalcond(],R) = 
IF I] = ({c a)) THEN testuntil(R)(first(1)) ELSE 
insert(choose, applytoall(testuntil(R), I)) 

where 

testuntil{R)({ (ec a) ) = WHEN present(c, R) THEN a 


The testing of conditions is defined by evaicond 
which returns the action corresponding to _ the 
condition that is detected to be true earliest. insert(f, 
1) and applytoall(f, 1) are primitive operations, whose 
meanings can be understood by considering the above 
definition of evalcond for the case when | is of the form 


((c, a,)(cg ag) - ++ (cpay)), for n > 1: 


evalcond(I,R) 

insert(choose, applytoall(testuntil(R), I)), 

choose(testuntil(R) ({c, a,)), 
choose(testuntil(R) ((c»2 ag)), 


ehoosetestuntut® ((Cy-1 An-1)); 
i the ((c, a,)). 


which effectively chooses the condition c,; that is 
detected to be true earliest. choose is a primitive 
operation which nondeterministiclly selects of one of 
its arguments, depending upon which is evaluated 
earlier. (It is possible to have a more "balanced" 
testing of conditions than the one shown above using 
insert, but we do not consider it here.) 


testuntil takes as input the set of requests R and 
produces a new operation which takes a condition- 
action pair (c a) as input and returns action a only 
when condition c is satisfied by the input requests in R. 
It is should be noted here that the input R can change 
by the arrival of new requests, hence a condition c that 
is not true for some set of inputs R may become true 


when R has more requests. 


Semantics of actions 


evalaction(a, R) = 
FOR EACH x IN a DO evaltype(x, R) 
where 
evaltype(x, R) = 
IF operation{x) THEN execute(remove(x, R)) ELSE 
IF x = (STAR I) THEN evalstar(1, R) ELSE 
IF x = (BRACKET J) THEN evalbracket(I, R) 
where 
evalstar(, R) = 
WHILE insertf{or, applytoall{test(R), 1)) 
DO evalplus(I, R) 


evalbracket(I, R) = 
IF insert(or, applytoall({test(R), 1) 
THEN {a <-- evalcond(I, R); 
req <-- remove(first(a), R); 
(execute(req) || evalbracket(I, R))}} 
ELSE ; 


test(R){(c a)) = presente, R} 


evaiaction defines sequential execution of the 
terms (x, Xg... x}) in an action a. evaltype treats the 
different types of x: If x is a resource operation, a 
request for x is removed from R and executed; 


otherwise, x is of the form (STAR J) or (BRACKET J), 
whose semantics are defined by evalstar and 
evalbracket respectively. (STAR J) and (BRACKET I) 
represent "*" and "[|" respectively, whose semantics 
differ primarily in that "*" specifies sequential 
iteration, whereas "| |" specifies parallel iteration. The 
iterations of '*" and "[]"' continue only as long as some 
condition in their body is true. This testing of this 
condition is specified in the semantic definition by the 
expression 


insert(or, applytoall(test(R), I)), 


where or is the primitive boolean operation and 
test(R){({c a)) returns a boolean value indicating if 
condition c is satisfied by the requests present in R. 


The primitive operator "||" evaluates both its 
arguments in parallel. It should be noted that the 
testing of conditions and removal of requests across 
successive iterations of "[]" take place sequentially, 
whereas the execution of requests across successive 
iterations of "||" take place in parallel. If the testing of 
conditions and removal of requests were to takc place 
in parallel, it is possible to detect conditions 
erroneously. 


3. Implementation 


The operational semantics of resource expressions 
may be viewed as defining an abstract interpreter for 
the language. However, our goal is to implement 
resource expressions by generating target code in 
terms of a set of synchronization primitives. We use a 
set of queueing primitives and primitives for 
arbitration and parallel execution to synchronize and 
schedule the execution of requests. These primitives 
were first defined in [8], and are given in the appendix. 


The set of input requests R can be represented by 
a set of queues, one queue for each distinct type of 
operation. The addition of requests to R from outside 
the resource is achieved by enqueueing them to the 
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appropriate queues. The removal of requests from R 
during the evaluation of a resource expression is 
achieved by dequeueing them from the appropriate 
queues. 


We first show the top-level structure of the target 
program constructed for an intermediate form, and 
then show the target programs fer conditions and 
actions separately. in each case, we first present the 
target program, followed by a brief explanation of the 
relation between the target program and the defined 
semantics. The entire construction is illustrated by an 
example. 


Top-ievel structure 


Assuming ] is an intermediate form 
({c, a,){ce ag)... (e,a,)), then the top-level translated 
program is 


LET t, = arbit(targetprogram(c,), ts) 
ty = arbit(targetprogram(c;),ts) 


t,-1 = arbit(targetprogram(c,_,), 
targetprogram(c,)) 
RESULT 
IF t, THEN targetprogram(a,) ELSE 
IF tg THEN targetprogram(ag) ELSE 


IF tnt THEN targetprogram(a,_,) ELSE 
targetprogram(a,) 


where targetprogram(c,;) and targetprogram(a,) are 
defined in the next two subsections. 


The abbreviations t,, ts, etc., are equated to 
expressions that are evaluated only once. The above 
program fragment realizes the semantics of evalcond. 
The semantics of choose is realized by the primitive 
operator arbit, which evaluates both its arguments in 
parallel and returns a boolean value indicating which 
argument was evaluated earlier. The chain of arbit 
operations realizes the semantics of the definition of 
evalcond for the case when n>i. When n=1 no 
arbitration is needed, and targetprogram(cl) is used 
directly instead of tl. 


Conditions 


Assuming c is a condition of the form 


(r, andrg::: andr;,), where each r; can be either 
($op,>n,) or ($op;=0), then 

targetprogram(c) = spar(w, We, .... Wy) 
where 


w; = waitq(qop;,,1+n,), ifr; is ($op,;>n,), and 
W, = waitq(qop,,0), if r,; is ($op,;=0) 
where | 

qop; is the queue for operation op,, 

spar and waitq are defined in the appendix 


\ 


The conjunction of the test for equalities or inequalities 
is expressed by the primitive operator spar. The above 
program fragment realizes the semantics of festuntil. 
Unlike the semantic definition of testuntil, 
targetprogram(c) does not return (c a), but instead 
returns true when c becomes true; the action a is 
selected by the top-level program. 


Actions 


The semantics of actions is specified by evalaction 
and is realized as follows: Assuming ais an action of the 
form (x, Xz... x;), then 


targetprogram(a) = {targetprogram(x,); 
targetprogram (xe); 


targetprogram(x,)} 
where 
targetprogram(x,) 
= evalq(qop,), if x, is an operation name op;, and qop, is 
the queue for operation x;. 
= star(queues-for-]), if x, = (STAR I) 
= bracket(queues-for-l), if x; = (BRACKET J) 
where star and Rracket are procedures that 
realize the semantics of evalstar and 
evalbracket 


The above program fragment realizes the semantics of 
evalaction and evalfype. Since evalaction sequentially 
‘evaluates each x;,, targetprogram{a) is obtained by 
sequencing the program fragments for each x,. The 
target program for x, depends upon its type: If it isa 
simple operation name, then a request of this 
operation type is dequeued and executed by evalg; 
otherwise it must be of the form (STAR 1) or (BRACE J). 
Since these two cases are similar, we present only the 
latter. 


Parailel repetition 
Assuming I = ((c,a,){ce ae)... fe, a,)), and 
queues-for-I denotes the set of queues for the distinct 
operations in I, 


brace(queues-for-]) = | 
LET t, = targetprogram(c,) 
te = targetprogram(cg) 


t,-1 = targetprogram(c,_,) 
t, = targetprogram(c,) 
RESULT 
IF orCtis-teyies by) 
IF t; THEN targetprgram(a,) ELSE 
IF tg THEN targetprogram(ag) ELSE 


IF t,-, THEN targetprogram(a,_,) 
ELSE targetprogram(a,) 


Assuming ¢; is of the form (r, and re and... ry) 
where r, is either ($op,>n,) or (op,;=0), then 
targetprogram(c) = and{m,, Mg, ... my), 
where 
m,; = nonempty(qop, 1+n,), if r, is (op;>n,), and 


m= nonempty(qop;,0), if r; is (op;=0) 


Since a; must be of the form (op) 
targetprogram{ai) = 
fres <-- deq{qop); 
spar(execute(res), brace(queues-for-I))} 


The recursive call on evalplus from evalbrace in th 
semantic definition is realized above by a sequence o 
IF-THEN-ELSEs. The above program tends to bias the 
selection of alternatives towards t,. This can be 
avoided by replacing the IF-THEN-ELSEs by a program 
fragment similar to that of the top-level structure. If 
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all t; are trivially satisfied, then the condition of the 
first IF statement is replaced by arbit{true, false), 
which nondeterministically decides the termination of 
brace. 


ample 
We illustrate our implementation by showing the 
ynthesized program for the resource expression 


(($write>0) write + ($write=0)[($read>0) read |)* 
whose intermediate form is 


true (STAR ({$write>0) (write)) 
((¢write=0) ((BRACKET (($read>0) (read))))) 


‘he complete translated program is 


‘ound(writeq, readq) = . 
LET t1 = arbit(waitq(writeg, 1), 
waitg(writeq,0)) 


/* ($write>0) */ 
/* ($write=0) */ 
AaKSULT 
WHILE true DO 
IF t1 THEN evala(writeq) 
ELSE bracket(readgq) 


/* write */ 
/*[($read>0) read] */ 


bracket(readq) = 
‘LET t1 = nonempty(readgq,1) /* ($read>0) */ 
RESULT 


IF t1 THEN {res <-- deq{readq); /* read */ 
spar(execute(res), bracket(readq))}} 


4. Conclusions and related work 


We have presented the systematic construction of 
the implementation of resource expressions starting 
from its operational semantics. From the example, it 
is easy to see that the structure of the target program 
preserves that of the original resource expression--- 
there is one procedure for each of the repetitive 
operators, "*" and "[]", and one procedure at the 
topmost level, if "*’ or "[]" does not occur at the 
topmost level. Conditions and actions are translated 
separately, and the resulting program fragments are 
assembled together. 


The work on the semantics and implementation of 
path expressions is related. The meaning of path 
expressions has been defined using Petri nets [11], 
denotational and axiomatic methods [2], and guarded 
commands [1]. The implementation of path 
expressions has been based on semaphores [3, 4] and 
finite state machines [7, i]. However, to the best of 
our knowledge, there has net been a systematic 
derivation of an implementation from the semantics of 
path expressions. 


The semantics and implementation of resource 
expressions differ from those of path expressions. The 
semantics of our sequencing operator "." differs from 
that of path expressions, which base the condition for a 
sequence on criterion 2 described in the first section. 
Our implementation based on queues is radically 
different from the implementations for path 
expressions. The advantage of our approach is that the 
structure of the target program corresponds closely to_ 
the structure of the original expression, hence the 
correctness of the translation becomes more evident. 
Our explicit use of conditions, however, give rise to 
longer specifications than equivalent path expressions. 


The conditions used in this paper are more restrictive 
than those found in [1], where conditions can test the 
state of the resource, the number of operations in 
execution, etc. 


There are some differences between resource 
expresssions presented here and our earlier paper [9]. 
Here we propose the use of conditions as a more 
primitive concept than the commit operator "/", which 
restricts the selection of a sequence to be based upon 
the presence of requests for operations only in some 
prefix of it. Furthermore, it is possible to simulate "/" 
using conditions, although the resulting specifications 
tend to be longer. In this paper, we have restricted the 
forms of expressions inside "[]" to be simple 
operations, with possibly some condition before it. This 
simplifies the definition of the semantics as well as 
implementation. 


We are examining extensions to resource 
expressions that will enhance its expressiveness. One 
way is to allow more general ferms of conditions than 
merely tests on the status of the input requests. 


Ancther is to alicow the actual parameters of the 
invoked operations to determine the conditions under 
which they are selected. The semantics and 
implementation of these features are being 


investigated. Also being investigated are optimizations 


that will enhance the efficiency of the translated 
programs. 
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Appendix 
The primitive operators used in this paper are 
summarized below: 


queue() creates an empty queue. 

enq(q,exp) synchronizes the evaluation of exp using 
g by enqueueing a request for exp in q. 

deq(q) removes the first request from q without 
executing it. 

execute(req) executes a request req; the result of 
execution is returned to the eng 
operation that placed req on the queue. 

evalg(q) combines deg and execute; and is same as 
execute(deq(q)) 

waitq(q,n) tests and waits until q has at least n 


requests and then returns true; if n=0 
then waitg waits until gq becomes empty. 


nonempty(q,n) returns true if q has at least n requests, 
false otherwise; if n=O then nonempty 
returns true if gis empty and false 
otherwise; no waiting is involved. 


evaluates the expressions aj,...,a, in 
parallel; the result is a,, but is 

returned after all a,,...,a, have 

been evaluated. 

evaluates a, and a, in parallel; the result 
is false if ag is evaluated before a,, 
otherwise true. 


spar(a,,...,a,) 


arbit(a,,ae} 


QUEUEING NETWORK MODELS FOR PARALLEL PROCESSING OF TASK SYSTEMS 


Alexander Thomasian 
Performance Modeling Center 
Burroughs Corp. 


Santa Ana, CA 90704 
Abstract -- The paper deals with a procedure 
to determine the mean completion time and 
related performance measures for a_ task 
system: a set of tasks with precedence 
relationships in their execution sequence. The 


tasks, which are processed by a multiprogrammed 
multiprocessor system, are specified by their 
expected total loadings on the units of the 
computer system. A straightforward application 
of a queueing network (ON) solver to the problem 
1S not possible due to variations in the state 
of the system (composition of tasks in 
execution). An approximate solution method is 
presented for this purpose based on the concept 
of hierarchical decomposition. At the higher 
level, an efficient procedure generates the 
Markov chain corresponding to the transitions 
among the system states and computes. state 
probabilities and other parameters as each state 
is created. At the lower level, the transition 
rates among the states are computed using a QN 
solver, which determines the throughput of the 
computer system for each system state. 
Numerical results are presented to justify the 
decomposition method and validated through 
Simulation. The approach is applicable to 
performance evaluation of programs with internal 
concurrency. 


1. Introduction 


An efficient procedure is developed in this 
paper to compute the performance measures for 
computations exhibiting parallelism in_ both 
centralized and distributed systems. The 
computation is specified by a task system (a set 
of tasks related by a deterministic precedence 
graph). Tasks are characterized by their 
expected total loadings at the devices of a 
given computer system, centralized or 
distributed. The potential for parallel 
processing occurs in numerous systems such as 
CPU:I/0 over lap or inherent concurrency 
exploited by multi-tasking in centralized 
systems [4,5,7], real-time distributed computer 
systems [2], query processing in distributed 
databases [1], etc. The technique presented in 
this paper can be used to determine the key 
performance measure for such systems, the mean 
completion time. 


Queueing network 
extensively used in 


(QN) models have been 
performance modelling of 
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P61. 


centralized/distributed 
However, such models 


computer systems 
assume that a _ program 
consists of a single process (task) which 
obtains service in a serial fashion from the 
devices in the QN model. Specifically, this 
implies that programs cannot hold more than one 
device at a time, and as such does not provide 
an accurate model for parallel processing. In 
addition closed QN models also assume a fixed 


workload, that is, when a job’ completes 
execution it is immediately replaced by an 
identical job. This is not the case in task 


Systems where a completed task may be replaced 
by a set of tasks with different job types. 


Recently, QN's were used to model programs 
with internal concurrency by Heidelberger and 
Trivedi [4,5]. These two papers also review 


previous work in this area, which is omitted 
here for the sake of brevity. In [41 a system 
where jobs are subdivided into two or more 
secondary tasks during their execution is 
considered. No synchronization among the tasks 
and the (primary) job is considered, however. 
In [5] a parent task spawns two or _ more 
concurrent tasks and has to wait until all such 
tasks are completed before it can proceed. Only 
very simple task systems can be handled by their 
approach. We consider much more complicated 
task systems than considered in [5], but allow 
only one instance of the task system at a 
time. The task system may correspond to the 
computations in a real-time system, in which 
case the task system is executed repeatedly, or 
the execution of a single instance of a program 
with internal concurrency. 


The paper will be organized as follows. In 
Section 2 we first describe the task system 
model required for analysis. Also described is 
the computer system model used in our simulation 
for validation of results. In Section 3 we show 


that the model of a task system with two 
concurrent tasks is nearly completely 
decomposable and based upon this observation 


present a decomposition method for obtaining an 
approximate solution. Exact results obtained by 
solving the set of linear equations describing 
the system and results using the decomposition 
method are then presented. In Section 4 we 
derive expressions to compute’ the key 
performance measures for a task system and give 
a procedure for efficiently computing these 
measures for general task systems. A task 


system 1s 
Simulation. 
a summary. 


solved and validated using 
Lastly, we conclude the paper with 


2. Task System and Computer System Models 


In Section 2.1 we define a basic task system 
model which is required for analysis by the 
procedure in Section 4. In Section 2.2 a more 
detailed description of the task processing 
system is given for simulation purposes. 


2.1 Task System Model 

The set of tasks is to be executed on a 
computer system with K devices. Taking a QN 
modelling viewpoint only the expected total 
loadings (loadings for short) of each task at 
each device, which is the product of the mean 
number of requests a task makes to a device and 
the mean service time at that device per 
request, are required for computing the usual 
performance characteristics of the system when 
‘the QN has a product-form [6]. In summary, a 
task system is specified by a 3-tuple (T,[<e ], 
LXs,]) as follows: 


ds ee is a set of tasks to 
be a he initial or final tasks can 
be dummy tasks with no _ processing 
requirements (loadings equal to zero). 
2. [<e ] is a partial order defined on T 
Specifying oe precedence 
constraints, i.e., iF <e means that T; 
must be completed before T. a begin. Only 
directed acyclic graphs (dags) are 
considered. 


Ss [X. J is an I x K matrix, such that X: 

is the cada of task i at device k. Mt 
this point, only active system resources 
such as the CPU and I/0 devices are 
considered. Passive system resources, such 
as memory requirements, can be incorporated 
easily into the model, but will be ignored 
in our discussions for the sake of brevity. 


The concise specification of a task system 
as given above is required for a procedure given 
in Section 4 to compute the performance 
parameters of interest for task systems. These 
parameters, which are defined in Section 3.2 are 
the mean completion time of the task system, the 
mean initiation and completion time of each 
task, and its mean execution time. These 
parameters are of course dependent on the task 
scheduler. For the sake of simplicity, we 
consider a single-processor (CPU) system and 
assume that all tasks are executed as soon as 
the precedence constraints are satisfied. The 
procedure in Section 4 can be easily extended to 
take into account passive resources and to 
incorporate a. more sophisticated task scheduling 
discipline. 
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2.2 Processing of Tasks in the Computer System 


The computer system processes different 
combinations of tasks according to precedence 
constraints until all tasks are completed (in 
the case of a real-time system this cycle is 
then repeated). The tasks generally have 
different processing requirements at the 
computer system. In other words, for each task 
combination in progress at the computer system, 
we have a closed QN with multiple job types. To 
simplify discussion, the QN model postulated by 
us has a product-form solution. Under this 
assumption only the task loadings need be 
specified to compute the usual performance 
measures uSing efficient algorithms such as 
convolution Or mean-value analysis [6]. 
Postulating.a non-product form QN would require 
the use of approximate solution methods to solve 
the closed QN for each task combination [6]. 
Regardless of which method is used for solution 
of the closed QN, the decomposition procedure 
presented in this paper would remain valid. 


The computer system model used is a 
generalization of the well known central-server 
model to multiple job types, called a 
centralized server model in. Shown in Figure 1 
is the centralized model with two job types 
(with one task in each job type) consisting of a 
CPU and a disk. The queueing discipline at the 
CPU is Processor Sharing (PS) and the disk has a 
FCFS discipline. The previous assumptions 
assure a product-form QN. Unlike conventional 
closed QN's, a completed task is not immediately 
replaced by a new’ task with similar 
characteristics, but the precedence graph is 
checked to determine if any new task can be 
activated under precedence constraints. 


3. Solving Concurrent Task Systems 
Using a Decomposition Approach 


section we illustrate the 
decomposition method by applying it~ to 
concurrent task systems, which is defined as a 
set of tasks which can be executed concurrently 
such as the two-task system shown in Figure 2. 
The tasks are executed on the multiprogrammed 
computer system given in Figure 1. The 
subscripts c and d are used to denote the CPU 
and the disk, respectively. Loadings for task 1 
and 2 are given as (120,200) and (220,400), 
respectively. The time units are in 
milliseconds. Both tasks have the same mean 
service times at the CPU (1/p, = 20 msec.) and 
the disk (1/uyg = 40 msec.). 


In this 


In Section 3.1 we justify the use of the 
decomposition method by building a Markov chain 
to solve the above task system exactly for its 
state probabilities, and also show that it is 
nearly completely decomposable [3]. In Section 
3.2 we present the decomposition approximation 
method, use it to solve the same task system, 
and compare decomposition results to the exact 
solution. 


3.1 Markov Chain Model 


In this section we build a Markov chain for 
the solution of the two-task system defined in 
Figure 2 and executing on the computer system in 
Figure 1. In order to reduce the number of 
states in the Markov chain we assume that the 
queueing discipline at the disk is also PS. 
Therefore we do not need to specify the ordering 
of the tasks at the disk, which would be 
required in the case of a FCFS discipline. This 
is possible since the steady state probabilities 
obtained under the PS assumption at the disks 
will equal those obtained with the FCFS 
discipline (after appropriate state aggregation 
at the disks in the FCFS case). The state of 
the closed QN can be specified by the 
composition of jobs in execution. The number of 
states for the detailed state representation 
when a subset J of tasks is active iS given by 


TT ( ee Ve 
j€J i 
number of states in the system increases rapidly 
with the number of concurrent tasks. 


(91: Needless to say, the 


Figure 3 gives the transition rate matrix 0 
for the system at hand. Denoting the steady 
state probability vector by p we have pQ = 0. 
In addition, we have the normalizing condition 
that state probabilities sum to one. After 
computing p by solving the linear equations, we 
can obtain the performance measures of the task 
system. 


At this point it is interesting to note that 
the transition rate matrix Q can be decomposed 
into three sub-matrices, indicated in Figure 3 


by dashed lines. For very small Pej, and 
Peacp» the six transition rate terms lying 
piesa de the dashed areas of the Q matrix are 


in which case we can say that the 
system is nearly completely decomposable [3]. 
Informally, it can be said that the system is 
decomposable when it reaches equilibrium between 
task completion instants. The latter is true 
when the occurrence of micro-events in_ the 
system (completions of tasks at devices) is much 
more frequent than macro-events (job 
completions). This is true in our system. The 
three groups of states into which the Q matrix 


negligible, 


Figure 2. Task System 


with Two Concurrently 


Detailed Centralized-Server 
Model of a Two-Task System 


Figure 1. 


Processing Tasks. 
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is decomposed correspond to the_ following 
aggregated states: (i) (1,2) - task 1 and task 
2 concurrently executing, (ii) (1) - task 1 only 
executing, and (iii) (2) - task 2 = only 
executing. We use this observation as the 
motivation for solving parallel processing task 
Systems using a decomposition method. Such an 
approach is particularly useful for systems with 
a large number of states, in which case it is 
not computationally feasible to solve the system 
exactly. 


AGGREGATE STATES (©) 


(0,0;1,0) 


aeawelilNae yc? Se ee ~ 
(1,150,0) Be Pe al 0 AY o A 0 
Z Perd1 2 Pcade Ls Pores x Peict 
(0,1;1,0) yy Wow o yop 
d cod c *e2d2 { 0 Pe Peage 0 0 
(1,050,1) Mg e Mohd PoPera) | 0 0 e Be Pete 
(0,031,1) 0 44 My Hy i) 0 i) 0 
z z- | 
— a ee oe eee ee eee ewe ce | ——_— = = 7 
(1,0;0,0) Po Petey : a i We Be Peld 4 . 
(0,031,0) 0 0 0 0 4 Mg hy | 0 0 
Bee cet ae tea le let 
! 
(0,150,0) Po Pc2c2 o 0 : 0 0 | hala Ho Pc2d2 
(0,0;0,1) 0 0 0 0 0 0 Ny By 
Figure 3. Transition Rate Matrix Q for Two-Task System. 


3.2 Decomposition Method 

Using decomposition, an approximate but 
highly accurate solution to our model can be 
obtained. The hierarchical decomposition method 
applicable in this case uses two modelling 
levels. In the higher level model the system 
State is specified by the combination of tasks 
in execution (aggregation of states of Section 
3.1). The transitions among these states are 
governed by a Markov chain model. The 
transition rates among the states of the Markov 
chain are determined by the mean throughputs of 
the computer system in processing various task 
combinations and are computed at the lower level 
of the model. This computation can be carried 
out efficiently, since the system was assumed to 
have a product-form solution for each execution 
State (see Section 2). Otherwise, the 
throughputs could be computed by solving the set 
of linear equations specified by each sub-matrix 
corresponding to each aggregate State. 
Alternately, an approximate solution method 
could be used at this point to obtain the system 
throughput. 


Figure 4 is the Markov chain respresenting 
the transitions among the aggregate states of 
the task system in Figure 2 (higher level 
model). Both tasks initially execute 
concurrently in state (1,2) until task 1 or 2 is 
completed, resulting in a transition to state 
(2) and (1) respectively. Task completions in 
each of these states lead to a transition to 
state (1,2), which indicates that the execution 
of the task system is complete, that is, another 
instance of the task system can be initiated. 
The states in the Markov chain are drawn at 
different levels, where the levels indicate the 
progression of the computation of the task 
system. Level one corresponds to the initial 
execution state, level two corresponds to the 
States which are feasible after a task is 
completed at the first level, and so on. The 
number of levels is equal to the number of 
tasks. 


The state probabilities P(1) and P(2) can be 
-expressed as a function of P(1,2) by solving the 
corresponding balance equations, 

On 


P(1,2) x To(1,2) / T,(1) (3.1) 


P(2) = P(1,2) wT 132) 7 Ta?) 


Setting P(1,2) = 1, we can obtain numerical — 


values for P(1) and P(2). A normalization 
constant NC = P(1,2) + P(1) + P(2) is then used 
to insure that the probabilities sum to one. 
This simple scheme to compute the steady state 
probabilities is of importance when dealing with 
large task systems. 


We define the mean completion (or cycle) 
time of the task system (C) as the average 
amount of time required to execute the _ task 
System on the given computer system. The mean 


© LEVEL 1 

1,0) 
LEVEL 2 
Figure 4, Markov Chain for Decomposition Method. 


—_ oe 


P(1,2) 
P(1) 
P(2) 
T,,.2) 
T)(1,2) 
T,(1) 

To(2) 


Table 1. Comparison of Exact and Decomposition Method Results 


for Two-Task System Example. Ty(e@) fn tasks/sec. 
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‘fraction of time a task 


execution time (E;) of Tj is the average amount 
of time T, is in execution. Also of interest is 
the mean initiation time (I.) and completion 
time (C;) for task T;. Note that E; = C; - Ia. 


The average rate at which the task system 
can be executed is given by the rate at which 
the initial state of the Markov chain (for the 
higher level model) is exited: 


Peisei xm TCLs t5( 1,2), 
or the rate of exiting states (1) and (2): 
P(1) x Ty (1) + P(2) x To(2) 


for our example. It can be shown that the 
values of these two expressions are equal. The 
mean system completion time is the inverse of 
this rate, 
C = [P(1.2) x [T,(1,2) + To(1,2)] 3°! (3.2) 
Alternately, if we let M(1,2) = [T,(1.2) + 
To(1,2)]~-", where M(1,2) is the mean time spent 
in state (1,2), then C can be expressed as, 
C = M(1,2) / P(1,2) (3.3) 
The mean execution time of each task is the 
is active in a cycle 


times the cycle time. It follows, 


By = Cle PC lg) PCL) (3.4) 

Ep = C x [P(1,2) + P(2)] 

Finally, it should be noted that the 
branching probability that task i completes 
first in execution state (1,2), b;(1,2), is 


given as the ratio of the throughput resulting 
in the completion of the task and the sum of the 
throughputs [8], | 


b; (1,2) 2 Te dige) / [T, (1,2) + To(1,2)] 


tals. (3.5) 

The expressions obtained above are very 
useful in that they can be easily generalized to 
task systems with arbitrary precedence 
relationships. 
3.3 Validation Results 

Table 1 shows a comparison of values 
obtained for state probabilities, mean 
throughputs, mean completion time and _ mean 


execution time using the decomposition method 
and the exact method. In order to make the 
comparison meaningful, a careful interpretation 
of the probability vector p and the transition 
rate matrix Q of the Markov chain model must be 


made such that results equivalent to the 
aggregated state model of the decomposition 
approach are _ obtained. Computations for 


obtaining the equivalent aggregated state values 
(in Table 1) from the Markov chain model are 
given in Appendix I. 


Table 1 shows that results obtained uSing 
the decomposition method are in close agreement 
with exact values. It is worthwhile to mention 
that the mean completion time for the _ task 
system is different from the mean completion 
time for the larger of the two tasks. 


4, Procedure for General Task Systems 


In Section 4.1 we present an overview of our 
procedure for solving general task systems using 
the decomposition approach and in Section 4.2 we 
give a formal statement of our procedure. In 


Section 4.3 we solve an example and compare 
decomposition results against Simulation 
results. 


4.1 Procedure Overview 

We are interested in obtaining the mean 
completion (cycle) time of a task system, as 
shown in Figure 5, as well as the mean 


initiation time, completion time, and execution 
time for each task in the system. 


One of the key problems in dealing with task 
systems is the large number of states in the 
Markov chain at the higher level of the 
decomposition model. The number of states 
depends on the number of tasks and_ the 
precedence relationships among them. It is 
clear that the cost of solving a large set of 
linear equations (pQ = 0) can be significant for 


larger task systems. 


Fortunately, an efficient scheme 1S 
available due to the fact that the task systems 
which we are interested in are directed acyclic 
graphs. As such, the corresponding Markov chain 
is also acyclic (with the exception of 
transitions to the initial state). Based on the 
observation made regarding (3.1) in Section 3.2, 
we can therefore express all probabilities as a 
function of the initial probability. 
Furthermore, such unnormalized probabilities can 
be used in computing the required performance 
parameters (such parameters will have to be re- 
adjusted by a normalization constant). Finally, 
to save memory space, the Markov chain can be 
generated one level at a time. Once an entry 
corresponding to a state at one level has been 
used to generate the entries at the following 


level, it can be deleted to save space. The 
number of states at each level is usually much 
smaller compared to N (the total number of 
states). 


In the case of concurrent task systems, all 
tasks are initiated at time zero and the mean 
execution time for task i (E;) equals the mean 
completion time for that task (C;). 
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For a general task system, the mean 
execution time for task i iS given by 
Ei = Cy - Wy 1 Gone of (4.1) 


where 1; is the mean initiation time for task 


1. I; is computed by considering all possible 
paths (in the Markov chain), which lead to the 


initiation of the task. Given that there are K 


paths leading to the initiation of task i we 
have, 
re ai TT bp 1%} wGy)) bj (4.2) 


k=] jel SES, 

In the above expression, Ly, is the set of 
links (branches) along path k leading to the 
State which immediately precedes the initiating 
State of task i. S, is the set of states along 
path k and b, is the branching probability from 
the last state in path k to the state in which 
task i is initiated. This computation can also 
be carried out a level at a time by updating a 
vector of entries for task initiation times. 
The details appear in the procedure presented in 
Section 4.2. 


Individual task completion times and more 
importantly the completion time of the task 
system can be computed using arguments similar 
to those used in deriving mean initiation 
times. We use this latter approach (rather than 
an approach based on equation (3.2)) to compute 
the mean system completion time. 


We point out that the computational cost of 
solving the product-form QN models to obtain 
throughputs is relatively small, since there is 
only one task in each job’ type. The 
computational cost to compute the normalization 


constant for the convolution algorithm when 
there are N tasks in a computer = system 
consisting of K devices is: ON+1KN [9], 


Additional savings in computational cost can be 
achieved by making note of state dominance, 
j.e., the successor states of a state (for one 
or more levels) have a subset of the tasks of 
the parent state in their composition (unless a 
precedence relationship is satisified and new 
tasks are initiated). The throughputs of these 
successor states can be obtained as a by-product 
of the computations of the parent state. 


4.2 Statement of Procedure 


The notation and some of the formulas used 
in the procedure are as follows: 


L = number of levels in Markov chain 
= number of tasks in task system (I) 
Sp = set of states at level 2 
S = current state (state under consideration) 
|S| = set of tasks in state S 


R 


St 


successor state to current state 


set of all successor states to current 


state, RE st 


P(S) = 


Steady state probability of S 


T,(S) = throughput at S due to 


completion of task i € S| 


T(S) = total throughput at S = }°> T,(S) 


M(S) =m 
=i 


1€S 
ean residence time in S, 


/T(S) 


be(S) = probability of branching from S$ 


to a new state R due to completion of 
task 1, bp(S) = 14 (S)/T(S) 


p(R) = path probability to reach R 
= 2) p(S) by(S) where R7 is the 
SER- 
set of the predecessor states of 
Note that >> p(S) = 1 
SES > 
D(R) = mean delay along path up to and 
including R 
= [D(S) * bp(S) + M(R) * p(R)] 
akk paths 
Each entry in the Markov chain can be 


represented by the following record: 


F i € {S ; 
TCS) 5) MCS) = De(S)a RS") 


PROCEDURE: Performance Evaluation of a 


Step 0. 


Task System 
Input parameters: 
I (number of tasks) 


[<e }] (precedence relationships among 
tasks) 


K (number of devices in computer 
system) 


Xr, 1 < i < I, 1 < k ¢< Ky (task 
loadings) ~— = 
Initialize dummy state Q at level 


zero (Sy = Q ): P(Q) = 0, P(Q) = 1, 
D(Q) = 0. 


Step 1. Generate Markov chain: 


for levels ££ = 0 to L-1 do 
for states SES 


2 do 


Determine all successor states to S: st 
(taking into consideration precedence) and 
create new entries for these states at level 
L+1 and initialize them (unless previously 
created). Sp+1 = Spaz US 


Compute branching probabilities from §S 
to R € St: 


be(S) = 1 


where i € |S| and completion of T; leads 
from S to R. 


for states R € S* do 
Compute throughput T;(R) for i € [R| 
- Compute total throughput T(R) 


= >> 1, (R) 
ié IRI 

- Compute mean time in state, 
M(R) = 1 / T(R) 

- Compute (partial) path -. probability, 
p'(R) = p(S) bp(S) 

- Compute mean path delay D(R) = 
D(R) + D(S) * bp(S) + M(R) * p'(R) 

~ Update total path probability to state 
R, p(R) = p(R) + p'(R) | 


~ Compute (partial) state probability of 
new state, 


P'(R) = 1 kg = 0 
T)(S) P(S) / TR) g > 0 


- Update total state probability 
P(R) = P(R) + P'(R) 


- Update initiation time of tasks newly 
activated in R, 
I; = I, + D(S) * bp(S) 


- Update completion time of task which 
executed in S but is no- longer 
executing in R 


C; = C, + D(S) * bp(S) 


- Update normalization constant NC = NC + 
P*(R) 


Step 2. Compute final results: 
- Compute task system completion time (C), 
for S € S,; do 
C= € + D(S) 
- Compute task i execution time (E,) 
for i = 1 to I do 
EF, = Cj - Ij 


- Normalize state probabilities 
for all P(S) do 


P(S) = P(S)/NC 


End of Procedure. 


It should be noted that the mean completion 
time (C) and individual task execution time (E;) 
could also be computed using the alternate 
formulas and techniques in Section 3.2. 


4.3 Numerical Example and Validation 


The above procedure was used to solve the 
task system of Figure 5 whose corresponding 
Markov chain is given in Figure 6. The inner 
model is a centralized server QN consisting of a 


CPU (1/p. = 20 msec.) and two identical disks 
(1/ug = 40 msec.). Task loadings are specified 
in Table 2. 

Results obtained from the decomposition 


method are compared against simulation results 
in Table 3. The simulation was run for 8000 
replications and 95% confidence intervals were 
obtained with interval halfwidths of less than 


Table 2. Expected Total Loadings (milliseconds) 


for Figure 5 Task System. 


Return to $(124) 
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From S(3), S(5), S(6) 


3% of the estimated mean value for all values 
presented. As can be seen, values obtained 
using decomposition compare very favorably to 
simulation results. 


5. Conclusions 


The paper deals with the issue of analyzing 
the execution of task systems to determine the 
mean completion time and the mean initiation 
time of tasks. The only parameters required for 
this computation are the loadings of the tasks 
at the various units of the computer system. An 


approximation approach based on _ hierarchical 
decomposition was developed to analyze the 
system. The computational procedure discussed 


in this paper can compute performance parameters 
while generating all possible (execution) states 
for the task system. The efficiency of this 
procedure is of essence for the feasibility of 
the method due to the large number of system 
states. Work is in progress in generalizing the 
procedure to handle the effect of scheduling, 


probabilistic task systems, etc. 
Risse Gade nace een 
Cc -1.07 


1 
2 
1 
oO. 
1 
1 
1 


1 
_ 
° 


Table 3. 


Comparison of Decomposition Method Results and 


Simulation Results for a Six-Task System. 


T, (124) 


Figure 6. Markov Chain for Figure 5 Task System. 
(Note: Only a representative set of throughputs shown.) 


REFERENCES 


1, P.A. Bernstein, et al. "Query processing in 
a system for distributed databases (SDD-1)," 
ACM Trans. Database Systems 6, 4(Dec. 1981), 
602-625. 


2 W.W. Chu, et al. "Task allocation’ in 
distributed data processing," IEEE Computer 
13, 11(Nov. 1980), 57-69. 


Queueing 
Academic 


3. P.Jd. Courtois, "Decomposability: 
and Computer System Applications, 
Press, 1977. 


4, P. Heidelberger, and K.S. Trivedi, "Queueing 
network models for parallel processing with 
asynchronous tasks," IEEE Trans. Computers 
31, 11(Nov. 1982), 1099-1109. 


5. P. Heidelberger, and K.S. Trivedi, "Analytic 

| queueing models for programs with internal 
concurrency," JEEE Trans. Computers 32, 
1(Jan. 1983), 73-82. 


6. S.S Lavenberg (ed.) Computer Performance 
Modelling Handbook, Academic Press, 1983. 
K.M. Chandy, and J.C. Browne, 
parallel processing within 
Application to CPU:I/0 = and 

ACM 21, 10(Sept. 


7. OD. Towsley, 
"Models for 
programs: 
1/0:1/0 Overlap," 
1978), 821-830. 


Comm. 


8. K.S. Trivedi, Probability and Statistics 
with Reliability, Queueing, and Computer 


Science Applications, Prentice-Hall, 1982. 


9. J. Zoharjan, "The approximate solution of 
large queueing network models," Ph.D. 
thesis, Technical Report CSRG-122, Computer 
Systems Research Group, Univ. of Toronto, 
August 1980. 


Appendix I 


Computing aggregated state values from the 
Markov chain model are performed as_ follows. 
Solving pQ = 0 for the steady state 
probabilities gives, 


p = [0.0764 0.0519 0.0538 0.1841 0.0491 0.0913 
0.01687 0.3247] 
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From these state probabilities, the 
aggregated state probabilities can be obtained, 


p(1,2) = p(1,150,0) + p(0,151,0) + 
p(1,0;1,0) + p(0,0;1,1) 
= 0.3662 
p(1) = p(1,0;0,0) + p(0,0;1,0) = 0.1404 
p(2) = p(0,1;0,0) + p(0,0;0,1) = 0.4934 


Mean completion time of the task system is 
computed by first obtaining the mean rate at 
which the task system completes. This is given 
by the total mean throughput from aggregated 
state S(1) and S(2) to aggregated state S(1,2) 
(see Figure 4). Taking the inverse of the total 
mean throughput gives the task system mean 
completion time in seconds, 


C = [ p(1,0;0,0) x pe x poyey + 


p(0,1;0,0) X He X Perc? 7! 
= 0.850 


The throughputs between aggregated states 
(in jobs/second) are computed by obtaining the 
mean total transition rate between each source 
and destination aggregated state, conditioned by 
the probability of being in the source 
aggregated state. This gives the following set 
of equations (see Figure 4), 


TyU1 2] = (p( 1513050) x 0.5°% Me X Pcicl 
* p(0,151,0) x He X Peic14 / ps2) 
= 2.209 
To[1,2] = [p(1,1;0,0) x 0.5 x Uc X Pc2c2 
+ p(1,0;0,1) X Ue X Pelcl- / p(1,2) 
= 1.050 
T,01] = p(1,030,0) x pe x Peiez / Pl) = 3.125 
where p(1,2), p(1) and p(2) are computed above. 


Lastly, task 1 and task 2 mean execution 
times, E, and Ej respectively, are computed 
using equation (4.3) in Section 3.2 and the 
above values. 


On the Performance of Interleaved Memories 
with Non-uniform Access Probabilities* 
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ABSTRACT-- System structure and program 
behavior are two major factors that influence the per- 
formance of a tightly-coupled multiprocessor. The latter 
has been usually ignored in most of the previous stu- 
dies. In this paper, we study the performance of a 
tightly-coupled multiprocessor in which a crossbar is 
employed to interconnect p processors to m memory 
modules. A set of non-uniformly distributed probabili- 
ties is also employed to illustrate the program behavior, 
but no distinction is made between processors. An 
inverse relation between the average request comple- 
tion time and the effective memory bandwidth is 
obtained and three approximation methods are pro- 
posed. Their solutions are compared with the exact solu- 
tion. Among them the Repetitive Augmenting Method 
which based on the idea of aggregation generates the 
best result. 


1. Introduction 


There is a well known mismatch between processor 
and memory speeds in computer systems. Memory 
speed is about one order of magnitude slower than pro- 
cessor speed. One technique being widely employed to 
solve this problem is to provide some parallelism for 
memory accessing by partitioning the memory into a 
number of modules, this is the so called interleaved 
memory scheme. 


In a tightly-coupled multiprocessor system, an m- 
way interleaved memory is shared by p processors. A 
processor can access each memory module via some 
processor-memory switch (interconnection network). In 
the past, a crossbar switch has been widely used to 
interconnect various combinations of computer subsys- 
tems including processors to memoryies. A crossbar 
switch allows a processor to access any memory module 
as long as it is the only one trying to access the memory 
module. Therefore, several requests can be satisfied if 
they access different memory modules, but only one of 
these requests to a given memory module can be 
satisfied. 


However, a crossbar switch suffers from the fact 
that it requires O(m.p) switching components to inter- 
connect p processors to m memory modules. The 
hardware cost can be enormous for large m and p. For 
this reason, a whole range of ingenious interconnection 
networks have been proposed that include the following 
: Banyan network, Omega network, Delta network, Base- 
line network, Data Manipulator network and Augmented 
Data Manipulator network. By no means, is the above 
list exhaustive. For a good survey, see [7] and [16]. 


While most of the recently proposed interconnec- 
tion networks require only O(n log n) switching com- 
ponents to interconnect n processors to n memory 
moduies, they do not preserve the bandwidth of a 
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crossbar switch [11]. Mudge and Makrucki [10] pointed 
out that, in the context of VLSI technology, reduced 
component complexity may be no longer an advantage 
within a single IC. For example, they compared the lay- 
outs of a Delta network and a bit-slice crossbar and 
found that the reduced component complexity does not 
appear to translate into more efficient space utilization 
in an IC layout [9]. 


Several investigators have studied the perfor- 
mance of a crossbar switch for tightly-coupled multipro- 
eessors ([2], [8], [4], [12] [18] and [14]). The results 
reported by these investigators were obtained by apply- 
ing either a simplified approximation mathematical 
model or some Markov Chain model to derive the 
effective memory bandwidth, that is the average 
number of requests that can be satisfied in one memory 
cycle time. 


The effective memory bandwidth does not only 
depend on the system structure but is aiso highly 
dependent on the distribution of all requests which are 
outstanding at a given time. This transient distribution 
for requests represents the program behavior. Most 
studies in the past have ignored this major factor by 
assuming all modules have an equal probability of being 
accessed by any given request. In [12] a trace driven 
Simulation technique was used to take program 
behavior into consideration. In the same paper Rau has 
also shown that the commonly used uniform access dis- 
tribution is not valid. Some modules are indeed refer- 
enced more often than others. 


Eiven though it is true, that in the long term, the 
probabilities of a request accessing different modules 
are uniformly distributed (as assumed by most models), 
the memory bandwidth at a given time is determined by 
all outstanding requests at that time. Thus, if we parti- 
tion the whole observation period into several sub- 
periods and are able to determine the program 
behavior in each subperiod, a more precise memory 
bandwidth can be obtained by averaging the effective 
memory bandwidths in all subperiods. 


For the above reasons, in this paper we study the 
performance of a tightly-coupled multiprocessor using a 
crossbar to interconnect p processors to m memory 
modules and assuming a set of non-uniformly distri- 
buted probabilities to illustrate program behavior. Let 
P(i) denote the probability that a request is directed to 
the ith module. It is assumed that P(i) is not necessarily 
equal to P(j) for i#j. From the memory bandwidth 
viewpoint, the program behavior of a multiprocessing 
system can be characterized by a set of P{i)’s. In this 
study, no distinction is made between processors. : 


In the following sections, a simple model for a mul- 
tiprocessor system is first described. Then one exact 
solution and three approximation solutions for finding 
the performance of the proposed model are presented. 


An inverse relation between the average request com- 
pletion time and the average memory bandwidth is also 
obtained. 


2. System Model 


In this paper, all analyses are based on a model 
which is similar to the one proposed by Rau [13] but 
which has a set of probabilities with non-uniform distri- 
butions to illustrate the program behavior. The model 
is shown in figure 2.1. The following assumptions are 
made: 

1) There are p independent processors (labeled as P,, 
Pz, ..., Pp) and an m-way interleaved memory (labeled 
as M,, Me, .... Mm) in the system, where p is not neces- 
sarily equal to m. 

2) The p processors and m memory modules are con- 
nected by either a cross-bar switch (hence each proces- 
sor can access any of the m modules) or a switch sys- 
tem which guarantees that a memory request issued by 
a processor can be satisfied if it is the only request 
directed to the desired memory module. 

3) The m memory modules are synchronized, i.e., they 
start a memory cycle at the same time and have an 
identical memory cycle time. 

4) At the beginning of each memory cycle, each proces- 
Sor generates a request. If a processor's previous 
request has not been satisfied, the processor will gen- 
erate the same request again. | 

5) Each memory module can serve one and only one 
request during a memory cycle time even though there 
may be other requests directed to it. 

6) There is a memory conflict resolution method which 
chooses one request to service by following some rule if 
more than one request references the same memory 
module, 

7) Requests made by each processor have P{i) probabil- 
ity to access the ith memory module. 


Figure £.. A Multiprocessor Sy) stem With p Processors and m Memory Modules. 


sethi and Deo have studied the performance of a 
multiprocessor system with non-uniform access proba- 
bilities [15]. However, they assumed that the memory 
in a multiprocessor system is partitioned into modules 
by the higher order bits of the address and the program 
behavior is illustrated by a processor having probability 
a to access the same memory module as its previous 
request and probability (1-a)/(m-1) to access a 
different module. Since no distinction is made between 
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processors and the program behavior is illustrated by a 
set of probabilities, our model fits the cases where 
several processors execute the same programs on some 
shared data and the memory is partitioned into memory 
modules by the lower order bits of the address. One 
such example was presented in [1]: in the batch binary 
search in a multiprocessing environment of p proces- 
sors and m interleaved memory modules. An ordered 
table of n elements is stored in memory in such a way 
that element X, is stored in module Mj with j = (i mod 
m)+1. Each processor, when free, takes a request from 
a queue, i.e., a key K, and performs a binary search 
algorithm on the table. Since some elements are refer- 
enced more often than others (for example, every query 
needs to access the "middle" element), the probabilities 
of accessing the various memory modules are different. 
The above example can be generalized to that of a 
directory of some file (or database) stored on a common 
interleaved memory. Each user, associated with a pro- 
cessor, searches the directory to respond to a given 
query. Since no distinction is made between the types 
of queries asked by all users, there is no distinction 
between processors. 


3. Exact Solution 


In this section we adapt the analysis of [2] and [3] 
for the uniform case to the non-uniform probability 
model. The complete set of states of the model 
described above can be defined as an m-tuple (ky, kg, 
km), where k; is the number of requests directed to 
the ith module at a given time. Since there are p pro- 
cessors in the system and each holds one request at a 
time, k,+Ket...4k,=p. The number of such states 
K=(k,,k2,...,Km) is the number of ways to partition p 
processors into m memory modules : C(p+m-1, m-1) [6] 
(C(i,j) is the number of ways to choose j elements from a 
pool with i elements). The number (NZ(K)) of non-zero 
integers in k,,ko,.... k, is the number of requests 
currently being serviced at the state K=(k,,k2,..., km). 
This model results in a Markov Chain as shown in figure 
3.1, since the choice of the next state is only affected 
by the current state. Each memory module has its own 
queue and each request has a probability P({i) to be 
directed to the ith memory module. If more than one 
request references the same memory module, one of 
them is chosen for service by some memory conflict 
resolution method and all others remain in the queue 
(regenerated at the next memory cycle) for future 
memory cycles. 


“A Markov Chain Vodel. 


igure 3.1 


It is assumed that each P(i) # 0 for 1<i<m., since in 
the cases where P({i)=0 for some i, there is no request to 
reference the ith module and this can be viewed as if 
there were only m-1 memory modules in the system. It 
was pointed out in [2] that in addition to being a Markov 
Chain, the system is aperiodic, since a transition from 
any state to itself is possible in one step, and the system 
is irreducible, since any state can reach any other state 
in a finite number of steps. Let f‘*xe denote the proba- 
bility of taking i transitions for the system to first 
return to its initial state K. It is not hard to see that all 


states in the system are ergodic, since >) f*xx=1 for 


i=1 

any state K. Thus, there exists a unique stationary dis- 
tribution for all states of the system [8]. This means 
that there exists a unique solution for all Prob(K)’s 
where Prob(K) is the probability of the system being in 
state K. 

We also define a temporary state K’ as (kj', £2’, ..., 
ky); where k,’ — k;-1 if ky>1 or k;' = ky if k;,=0. The 
temporary state K’ represents the state of the system 
at the end of the current memory cycle and before each 
"satisfied" processor generates a new request. By 
definition, it is easy to see that k,tkg +:°: +k, =k, + 
ke'+ ...t km’ + NZ(K) = p and 1<NZ(K)<m. It is also 
true that NZ(K) = NZ(K’), since k;’ <k,; for 1<i<m. 


Let K'= (ky', ke’, ... km’) be the uniquely deter- 
mined temporary state of state K=(k,,k2,...,km). At 
the beginning of the very next memory cycle, there are 
NZ(K) free processors and each generates a new 
request. There are C(NZ(K)+m-1,m-1) ways to partition 
NZ(K) new requests into m memory modules. Let 
(d,,dz,...,dm) be one such possible result, where 
NZ(K) = dytdgt..+d, and d; is the number of new 
requests referencing the ith memory module. Let N= 
(ky'tdy, ke'tde, ..., km'tdm) and P(K > N) be the proba- 
-bility of having the system transfer from state K to state 
N in one step (one memory cycle). Then 


P(K>N)=C(NZ(K),dq).P(1)"!, C(NZ(K)—dy de), P(2)? 
'P(m) 
=(NZ(K)!/ (dy!de!...dm!)).P(1)"!, P(2)°2... Pim) 
(where n! = n.(n-1)...2.1). 
Thus we can compute the probability P(K > N) for 


each possible state N which can be reached from state K 
in one step (there are C(NZ(K)+m-1, m-1) such states), 


Let Prob(N) be the probability of the system being 
in state N. Then the following two conditions are 
satisfied [8] : 

1) For each state N, prob(N) =}) Prob (K).P( KAN). 
K 


2) 3} Prob(N) = 1. 
N 


Since there are C(p+m-1,m-1)+1 equations which 
satisfy either conditions (1) or (2) and C(p+m-1,m-1) 
unknowns, these equations can be solved. Gauss elimi- 
nation is one possible way to do so, although there are 
better programming methods as shown in [3]. 


Once all the Prob(K)'s are known, several important 
performance factors can be calculated as follows : 
1) the average memory bandwidth B 
= )) Prob(K).NZ(K), 


K 
2) the utilization factor of module i ({%), i.e., the proba- 
bility of the ith module being busy 
= S$)  Prob(K=(kyke,..., km )), 

K with k,>0 
3) the average number of requests, including the one 
currently being serviced, which are queued in the ith 
memory module (f;) 
= yy Prob(K=(k,,ke,....Km)). &. 
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Although the number of possible states C{p+m- 
1,m-1) = p+1 inthe case where m=, the number of pos- 
sible states in general is very large. For instance, 
C(p+m-1,m-1) = 6,435 for m=p=8 and 300,540,195 for 
m=p= 16. Baskett and Smith pointed out in [2] that 
C(p+m-i,m-1) grows as fast as 4? for p=m. Therefore, 
the procedure to find out the exact solution for a sys- 
tem is very time consuming and this method is unrealis- 
tic for a system with medium to large p and m. 'I'here- 
fore we need to explore some approximation solutions. 


4. Memory Bandwidth and Request Compietion Time 


A request generated by a processor takes one 
memory cycle time to be serviced and spends zero or 
more cycle times waiting for service. Let us call the 
total time that a request is staying in the system (sum 
of the service time and waiting time) as the request 
completion time. It is obvious that the request comple- 
tion time is an important metric of the system perfor- 
mance, Let 7; denote the request completion time for a 
request directed to the ith memory module and J; be 
the average number of requests queued in the ith 
memory module, including the one being serviced if 
any. It is easy to see that L,t+tLet...tL,z=p. Let 
t=3\ 

i=l 

By the definition of 
on the average B (memory bandwidth) requests being 
satisfied during one memory cycle. That is, on the aver- 
age B new requests are issued at the beginning of each 
memory cycle. Therefore, the arrival rate for memory 
module M, is B.P(i). 

By Little’s Law, the number of requests in queue 

= (arrival rate) X {the average time that a customer 
stayed enqueued). 

Applying Little’s | law to our model yields 


2, B.P(i). 7 = 2) 4 = pand 
t=] t=1 


P(i).T, be the average request completion time. 


“™ wiry, ha alr Atl. tin NWN UN 
Memory oanawiaul, wiere are 


B=p/(¥ P(t). T)=p/T (1) 


By equation (1), there is an inverse relation 
between memory bandwidth and request complction 
time. Two approximation methods are proposed in the 
next section. One of them is derived from this inverse 
relation. 


5. Two Simple Approximation Methods 


We first consider an approximation method which is 
a generalization of the approximation method presented 
in previous papers ([2], [4] and [17]). In our model, 
each processor with its previous request unsatisfied will 
regenerate the same request at the next memory cycle. 
Thus, at the beginning of the next memory cycle, the 
number of newly generated requests may be less than p. 
If it is assumed that each processor, independently of 
whether its previous request is satisfied or not, gen- 
erates a new request at each memory cycle, it can be 
shown [18] that the average memory bandwidth B is 
given by : 
B=m-S\(1-P(i)P (2) 


4=1 


Since 1-P(i) is the probability for one request not to 
access the i-th memory module and (1—P{i))}? is the 
probability for all p requests not to access the i-th 
memory module, m- 


(1—P(i)}? is the expected 
i= 


number of busy memory modules (i.e., at least one 


request among p possible ones directs to it). When 
P(i)= 1/m for 1<i<m, B = m-m(1- 1/m)?. This is con- 
sistent with the result presented in [2], [4] and [17]. It 
is nat hard to see that equation (2) has a maximum m- 
m.(1- 1/m)? occurring at P(1)=P(2)= ... =P(m)= 1/m, 
and a minimum 1 at P(j)=1 for some j and P{i)=0 for all 
other 7 #7. 

since the queueing behavior of unsatisfied requests 
is ignored, the above result is optimistic, ie., will yield 
larger B's than the exact solution. In the rest of this 
section, we derive another approximation method by 
utilizing the inverse relation (as shown in equation (1) of 
section 4) between the average memory bandwidth and 
the average request completion time. 

In our second approximation method we isolate a 
processor whose request has not been satisfied. Let us 
assume that its request was for the ith memory module. 
Then at the next memory cycle it will again access M, 
while the remaining (p-1) processors generate new 
independent requests. A memory conflict resolution 
method, which randomly chooses anyone of the 
requests directed to a given module, is also assumed. 
Note that different conflict resolution methods may 
cause little difference in the overall performance [4]. 

Let @ denote the probability for a request directed 
to the ith module of not being satisfied at a given 
memory cycle. 

The probability of having j-i (j=1) other requests 

directed to the same (ith) module is : 
C(p-1,-1).(P@) PY. (1-P@))P 4 

Thus, Q=3) ((j-1)/j). C(p-1.-1).(P@)P71.(1-Pa)P, 


=1 
where the factor (j-1)/j indicates that one out of j 
requests will be satisfied. Thus, the probability of a 
request directed to the ith module being satisfied at a 
given memory cycle is: 


1-Q= 3 (1/9).C@-1,9-1). (PPE (1 -(P@) Po 


j=l 


=(1/(p.P(é)). 33 Cp f) (P()¥.(1-P))P 
j= 


=(1/ (p. P(i)).1-(1-P(i))?) 
The average completion time for a request directed to 
the ith module 7; is 


Dk. (1-@). FS (1-@). Ye. G* 1=(1-G,).171-@)P 
k=1 


1 


Halt 


=1/ (1-@)=p.P()/ (1-(1-P())?) 
The average request completion time T is then : 


S4.0= =p. 2 PG) (1-(1-PW)P) 


t=1 


By equation (1), B=p/ P=1/ (PUY (1-(1-P(i))”)) 
_ (3) 


This result is in gencral still optimistic but more 
accurate than the previous one as shown in Tables 5.1, 
5.2 and 5.3 (see below). It could still be improved if we 
could find a more precise way to compute 7;. 


For convenience, in the following discussion, the 
two approximation methods to compute the average 
memory bandwidth according to equations (2) and (3) 
are called Approximation Method 1 and 2 respectively 
(AMi and AM2). Tables 5.1, 5.2 and 5.3 compare the 
exact solution of the average memory bandwidth 
(EXACT) and the two approximation solutions (AM1 and 
AM2) calculated by equations (2) and (3) when m=p=2, 4 
and 6 respectively. There are 15 cases in each table and 


each row represents a case being considered. FAD,, for 
i=1 or 2, is the relative difference between EXACT and 
AMi and is defined to be 100./EXACT-AMi|/EXACT. Since 
it takes an unbearable time to compute the exact solu- 
tion for large m and p, we did the comparisons only for 
small m and p. 


By comparing the results shown in Tables 5.1, 5.2 
and 5.3, we can observe that: 
1) In all cases, both AM1 and-AM2 are optimistic. How- 
ever, AMe is closer to EXACT than AM1. 
2) For a fixed m and p, both AM1 and AM2 are closer to 
EXACT (with smaller RD,) when the probability distribu- 
tion is more uniform (case (B) in Table 5.1, 5.2 and 5.3). 
3) As m and p increase, in some cases RD; may get 
fairly large. For instance, AD,=50.11 and AD,=32.50 in 
case (2) of Table 5.3, 


Because of the disturbing feature of the last point, 
a new approximation method which takes polynomial 
time and yields better results when the exact solution is 


earl A, 


not feasible is proposed in the next section. 


‘Table 5.1 


P(t) Bt} EXACT AM] RN1 AM2 RN2 
(1) 0.89 O.71 41.1217 1.1958 6.61 1.1628 3.66 
C2). 05,73: O525> 153000 1.3750) S287 1.4462 3.55 
(3) 0.68: 0.32 T.3854 1.4392 3.60 1.41272 2230 
(4) 0.63 0.37 1.4367 1.4662 2.05 1.4540 1.34 
(5) 0.59 0.41] 1.4686 1.4838 1.03 1.4786 0.68 
(6) 0.54 0.46 1.4936 1.4968 0.21 1.4957 0.14 
(7) 0.45 0.55 1.4901 1.4950 0.33 1.4934 0.22 
(8) 0.50 0.50 1.5000 1.5000 0.00 1.5000 0.00 
(9) 6.43 0.57 1.4808 1.4902 0.63 1.4870 0.42 
(10) 0.35 0.65 1.4174 1.4550 2.65 1.4417 1.71 
CLL) O.29- O.71 2.350). 1o4718 42:57 J43889. 2.87 
(12) 0.26 0.74 1.3127 1.3848 5.49 1.3574 3.41 
(13) 0.17 0.83 1.1966 1.2822 7.15 1.2464 4.16 
(14) 0.14 0.86 1.1586 1.2408 7.09 1.2053 4.03 
(15) 0.98 0.02 1.0204 1.0392 1.84 1.0300 0.94 


Memory Bandwidths Derived From Exact Solution and Two Approximation 
Solutions for m=p=2. 


Table 5.2 
P(1) P(2) P(3) P(4) EXACT AM1 RD1 AM2 RN2 
(1) 0.88 0.04 0.04 0.04 1.1364 1.4518 27.75 1.2400 9.12 
(2) 0.67 0.16 0.11 0.06 1.4921 2.0821 39.54 1.8045 20.94 
(3) 0.54 0.17 0.13 0.16 1.8420 2.4099 30.83 2.2182 20.42 
(4) 0.46 0.24 0.18 0.12 2.1113 2.5295 19.82 2.4224 14.75 
(5) 0.33 0.41 0.06 0.20 2.2083 2.4870 12.62 2.4282 9.96 
(6) 0.31 0.23 0.28 0.18 2.5416 2.7009 6.27 2.6868 §.7) 
(7) 0.28 0.15 0.35 0.22 2.4498 2.6606 8.60 2.6299 F235 
(8) 0.25 0.25 0.25 0.25 2.6270 2.7344 4.09 2.7344 4.09 
(9) 0.22 0.19 0.13 0.46 2.1154 2.5415 20.14 2.4327 15.00 
(10) 0.18 0.22 0.30 0.30 2.5345 2.6975 6,43 2.6820 5.8? 
(11) 0.16 0.29 0.40 0.15 2.2965 2.5764 13.06 2.5340 19.34 
(12) 0.13 0.21 0.15 0.51 1.9398 2.4579 26.71 2.2994 18.54 
(13) 0.08 0.38 0.16 6.38 2.2065 2.4902 12.86 2.4247 9.89 
(14) 0.06 0.24 0.57 0.13 1.7457 2.2785 30.52 2.0884 19,63 
(15) 0.05 0.35 0.45 0.15 2.0705 2.3938 15.61 2.3118 11.465 
Memory Banawidths Derived rom Exact Solution and To Approximation 
Solutions for map=4. 
Table 5.3 
P(1). P(2) P(3) P(4) P(5) P(6) Exact? AM] Rn] AM2 RN2 
(1) 0.19 0.30 0.12 0.13 0.15 0.11 3.1892 3.8278 20.02 3.7154 18.47 
(2) 0.47 0.04 0.02 0.10 0.19 0.38 2.1260 3.3914 50.11 2.8170 32.50 
(3) 0.16 0.21 0.15 0.11 0.09 0.28 3.2950 3.8243 16.06 3.7350 33.35 
(4) 0.26 0.13 0.27 0.14 0.19 0.07 3.3874 3.8251 12.92 3.758) 10.94 
(5) 0.09 0.11 9.20 0.25 0.22 0.13 3.4079 3.8362 12.57 3.76846 10.58 
(6) O.11 0.14 0.09 6.18 0.28 0.20 3.3000 3.8251 15.9] 3.7373 13.25 
(Ty O.2) 0007 OT) 0.015 0632 0,148 - 3.0207 329394 23.50 3.5858 18.71 
(8) 0.16 0.17 0.16 0.17 0.17 0.17 3.7781 4.9896 se Al 3.9897) S58 
(9) 0.27 0.19 0.13 0.10 0.14 0.17 °3.3886 3.8697 14,20 3.79387 12.10 
(10) 0.10 9.12 0.13 0.17 0.19 0.30 3.1809 3.8199 19,81) 3.6973 14,73 
(Lt) 0.21 0.14 0.23 0.11 0.25 0.06 3.3511 3.779) 12.77 3.7078 10.44 
(12) 0.13 0.15 0.22 0.14 0.17 0.19 3.6552 3.9501 8.07 3.9278 7.46 
(13) 0.31 9.33 9.11 0.33 0.08 0.04 2.7677 3.4819 25.89 3.2790 16.47 
(14) 6.10 0.21 0.13 6.12 0.18 0.26 3.4096 3.8592 13.09+ 357973: VST 
(15) 0.36 0.12 0.15 0.12 0.13 0.32 3.0523 3.8103 24.83 3.6657 20.08 
Memory Bandwidths Derived From Exact Sojuticn end Tvo Approximation Solutions 


for m=p=868 
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6. Repetitive Augmenting Approximation Method 
(RAAM) 


The next approximation method is based on the 
idea of aggregation. A system of i memory modules can 
be viewed as the union of two subsystems as shown in 
figure 6.1. One subsystem consists of only one memory 
module (say the ith module) and the other consists of 
all the remaining memory modules. For convenience, 
we denote the subsystem with the single memory 
module i as subsystem (a) and the other consisting of 
modules 1,2, ...,i-1 as subsystem (b). Assume that we 
know how to derive the performance behavior of the 
aggregated system if the performance behavior of sub- 
system (b) is known (the detail will be shown later). 
Given a system of m memory modules and p processors, 
the algorithm for the Repetitive Augmenting Approxima- 
tion Method proceeds by starting with a subsystem of 1 
memory module (say module 1) and then repetitively 
augmenting the number of memory modules being con- 
sidered in the sequence &,3, ...,.m. 


P(i)/S,; 


P(1)/5;_4 


suhsystem (a) 


losure © 


Pe areas oka Oe ral Bees eros cml eu vO ee 


In this paper, we shall assume that the behavior of 
a subsystem can be approximately represented by the 
probabilities | PRO(q,n,s). PRO(q,n,s) denotes the 
probability for the subsystem of having s requests 
satisfied at one memory cycle while at the beginning of 
that memory cycle there were n new requests directed 
to it and q requests originally queued in it. Note that 
q+n is the total number of requests in the subsystem. 
In the following, we will first discuss how to initialize 
those PRO(q,n,s)'s for a subsystem of 1 memory module. 
Then show how to derive the performance behavior of a 
system with i memory modules if those PRO(q,n,s)'s for 
a subsystem of i-1 modules are known. 


In a subsystem of one memory module, the proba- 
bility of having one request satisfied (s=1) while at least 
one processor was in it (i.e., gtn> 1) is 1. Thus, we ini- 
tialize all PRO(q,n,s) with g+n =1 and s=1 to be 1 and all 
others to be 0. 


A system of i memory modules and p processors, as 
shown in figure 6.1, can be viewed as the union of sub- 
systems (a) and (b), where subsystem (a) consists of 
one memory module and subsystem (b) consists of i-1 
memory modules. Let P{i) denote the probability for a 

fp (7). 
j=l 
Then for each request, the probability (P,) of accessing 
subsystem (a) is P({i)/S; and that (P,) of subsystem (b) 
is 4-1/7 OG i 


request to access memory module i and S; = 


433 


Assume that all PRO(q,n,s)'s for subsystem (b) are 
known, where q+n=j and 1<s< minf(i-1,j) (i-1 and j are 
the numbers of memory modules and processors in sub- 
system (b) and j can be an integer ranging from 1 to p- 
the total number of processors in the whole system), 
The set of possible states for the aggregated system is 
defined as the set of 3-tuples N=(ng,ng,7m), where ng is 
the number of requests currently being serviced, and 
MN, and nm are the numbers of requests queued in sub- 
systems (a) and (b) respectively. For a possible state 
N=(no,.7Ny 7), Npt+Ng +n, =j and np<i are necessary but 
not sufficient conditions. (For instance, N=(1,ng,n,) 
with n,,7m=1 and l+n, +n, =j is not a possible state but 
satisfies these two conditions.) Therefore, the number 
(NS) of possible states is bounded by C(j+2,2)= 
(j+2).(j+1)/2 (the number of ways to partition j ele- 
ments into 3 sets). 


Let Prob(N=(n9,7%g,n,)) be the probability of the 
aggregated system being in state N and let PRN(q,n,s) 
denote the probability of the aggregated system having 
s requests satisfied at one memory cycle while at the 
beginning of that memory cycle there are n new 
requests directed to it and q requests queued in it., 
Since the states and associated probabilities for subsys- 
tem (b) are assumed to be known {i.e., all the probabili- 
ties PRO(q,n,s) for 1<s<i-—1 and l<qin=j<p ), all pro- 
babilities Prob(N=(m9,n,g 7m» )) for the aggregated system 
can be computed. Once all the probabilities Prob(N) for 
the aggregated system and all the _ probabilities 
PRO(q,n,s) for subsystem (b) are known, we can com- 
pute all the probabilities PRN(q,n,s) for the aggregated 
system. The PRN(q,n,s) then can be used for further 
aggregation. When all m modules have been considered 
the memory bandwidth B can be calculated by the 
equation : 

B= Sino. Prob (N=(ng, Ng .7p ). 
N 


In the following, we will show how to compute 
Prob(N)'s and PRN(q,n,s)'s. In the _ procedure 
DISTRIBUTION(i,j) we compute the probability of being in 
a "legal" state N =(n9,7g,7,) for the current system of i 
memory modules and j processors. Assume that among 
the 7m, requests currently being serviced, dg and dy will 
be directed to subsystem (a) and (b), respectively, at 
the beginning of the next memory cycle (d,+d,=np). 
Let N’= (0,ng+dg,n,+d,) denote the temporary state 
which deseribes the number of requests which will be in 
subsystems (a) and (b) for the next memory cycle. The 
probability of the subsystem transferring from state N 
to state N’ (P(NN’')) is (no!/ (da !dy!). p, p,% (recall 
that P, and ® are the probabilities for a request to 
reference subsystems (a) and (b) respectively). Assume 
that k requests (k must be less than or equal to the 
number of memory modules, i.e., i-1), among the ny +d, 
requests which stayed in subsystem (b), are satisfied. 
The next state will become the state Npo=({n'g,n'g, 7'p), 
where m'p=k+1, 1 g=Ngtdg-1 and n,=ny+d,-k if 
Na td,=1 or n'g=k,n',=0 and n', =n, +d,-k ifn, +d, = 0. 
The probability of transferring from state N’ to state No 
P(N Ne ), is PRO(g =m, m=d,, s=k) as applied to sub- 
system (b) and P(N,>N2)= ))P(N1AN'). PUN No). 

N 


Once all the one-step transition probabilities 
between each pair of states are known, the probabilities 
for the system to be in state N (Prob({N)) for all possible 
3-tuples N can be computed by the same procedure as 
the one for finding the exact solution (cf. section 3). 
Details of the procedure DISTRIBUTION(i,j) can be found 
in [5]. 

In the procedure FINDPRN(i,j) we compute the 
PRN(g,n,8)'s, ie., the probabilities for the current sys- 
tem of i memory modules and j processors of having s 


requests satisfied while there were q requests enqueued 
and n new requests generated. After the PRN(q,n,s)’'s 
have been obtained, we assign them to corresponding 
PRO(q.n,s)'s for future computations. Assume that the 
system is in state N,=(no.n,,%,) and there are d, and 
d, requests, among the nm, currently being satisfied, 
being directed to subsystems (a) and (b) respectively. 


Let us also assume that there are k requests, among all 
requests in subsystem (b), being satisfied. Let 


N=(0,ngt+dy,2,+d,) be the temporary state and 

No=(n'o,n'g.m',), Where n'g=k+1, m',=n,+d,-1 and 

nmy=nmtd,-k iif mgtdg=1 or ~*7n'p=k,n'~=0 = and 

'y =Ny +dy-k if ngt+d,g=0. Then according to the previ- 

ous discussion P(N,7?N2)=),(ng!/ (dg!dy!) 
N 


P,P, PRO(n, dy i). This also means _ that 
P(N,>Ng2) is the probability for the system in state N, 
of having 7m’, requests satisfied while there were ngt+ny, 
requests enqueued and mg new requests generated. Let 
SP(q.n) be the sum of all Prob(N,=(no,ng,7,)), where 
g=nNgtn and n=. Then it is not hard to see that 


PRN(q.n,s)= aoe P(N1=(No.%q My) 
ite 
No=(n' o.n'g.t'y)).Frob(N,)/ SP(q,7), where 


G=Ngtny,NM=No and s=7'p. Since Np is always greater 
than 0 in every possible state N=(n9,7g,7,), the way of 
deriving PRN(q,0,s) is somewhat awkward. It is basically 
assumed that all 79 requests are directed to somewhere 
outside of the current system. Details can be found in 
[5]. 

In RAAM, we start with a subsystem of one memory 
module and then the number of memory modules being 
considered (i) will be incremented by one until i=m-1. 
In the system of i memory modules, the procedures 
DISTRIBUTION(i,j) and FINDPRNG,j) are executed for j 
ranging from 1 to p because of their need in future com- 
putations. 


After executing FINDPRN(m-i,p), all the probabili- 
ties PRO(q,n,s) for the system of m-1 modules are 
known. The probabilities Prob(N) for the whole system 
(with m memory modules and p processors) can be 
computed by DISTRIBUTION(m,p). Then the memory 
bandwidth B for the whole system is derived from those 
Prob(N). 

It is obvious that the insertion sequence of the 
memory modules into the system will affect the derived 
result. One suggested criterion is to label the m 
memory modules in such a way — that 
P(l)jeP(2)<:--<P(m), since the later a memory 
module is put into consideration, the more accurate 
and infiuentiai it will be. The memory module with the 
largest probability being referenced should therefore be 
"aggregated" last. This has been validated as shown in 
Table 6.1. In all cases where the probabilities were in 
ascending order the Repetitive Augmenting Approxima- 
tion Method had a better result (closer to exact solu- 
tion) than that of a descending or random order. Note 
that not only the average memory bandwidth for the 
whole system (with p processors and m memory 
modules) can be computed by RAAM, but also that of a 
sequence of subsystems. 

With regard to the complexity of the algorithm, the 
following observations can be made: 

1) In a system currently of imemory modules and j pro- 
cessors the number (NS) of possible states is less than 
C(j+2,2)=(j+2).(j+1) 72. 

2) For a given state N=(ng,n,,7,), there are 
Not 1<min (i,7j)+1 temporary states 
N=(0,n,+d, 7% +d,), where d,+d,=ny. Each temporary 
state N=(0.n,+d,,n,+d,) can transfer to at most 
min(i,n, +d, +1)<min (i,j)+1 states in one step. 

3) Once all the one step transition probabilities are 


known, Gauss elimination can be used to solve the NS 
equations. It takes O(NS°) time to do so. Therefore, the 
time complexities for Procedures DISTRIBUTION(i,j) and 
FINDPRN(i,j) are no more than O(NS.min(i,j)®+NS3) 
and O( NS. min (i,j)*) respectively. 7 

4) In addition to computing DISTRIBUTION(m,p), the 
complete algorithm necessitates computing 
DISTRIBUTION(i,j) and FINDPRN(i,j) for i=2 to m-1 and 
jJ=1 top. It therefore takes only polynomial time to find 
the average memory bandwidth. 

Tabies 6.2 and 6.3 cormpare the exact solution with 
all three approximation solutions. RAAM is the approxi- 
mation solution computed by the Repetitive Augmenting 
Approximation Method and ADp, is the relative 
difference between EXACT and RAAM. 

The following conclusions can be reached : 

1) Among all three approximation methods, RAAM has 

the best result and AM@ is next. 

2) In all cases, AM1 and AM2 are optiruistic. However, 

RAAM is more conservative. 

3) Contrary to AM1 and AM2, RAAM has its worst case 

ee all reference probabilities are uniformly distri- 
uted. 


P(1l) P(2) P(3) P(4) P{5) P(6) EXACT RAAM RD an 

(R)° 0.31°0523-9.28 0.18 2.5416 2.3040 9,35 
(O)o0232. 0228-023 0.58 2.5415 2.2680 10.76 
(A) 02164) ..23. 0928 0.32 2.5416 2.4182 4.86 
(R) 0.28 0.15 0.35 90.22 2.4498 2.3162 5.45 
(Dp) 0.35 0.23 0.22 0.15 Se 4h O 2e3l62. 10.95 
(A) 0215-0, 22.0.28- 68235 20en08 2.3709 3.22 
(R) 0.22 0.19 0.13 0.46 P1154 2.0770 1.82 
(D): 0246 0.22 0.390513 2 2Loe De 952° FP. 6a 
(A) LS Osho O222- 0.46 2.1154 2.0952 0.95 
{RR} 0.30 0.22 50.30 0.18 2.5345 2.3152 8.65 
(Dy 0.3.09 (0.30 02220212 2.5245 2.2571 10.94 
(A). 022.8 0.22 °0530-0.30 2.9345 2.4148 4.72 
(R) 0.16 0.29 0.40 0.15 2965 2.1836 4,92 
(D) 0.40 0229 D318. 0.15 222965. 2.0572> 2042 
(A) 0.15 0.16 0.29 0.40 2.2965 2.2607 1.56 
(R) 0.2) 0.14 0.23 0.11 0.25 0.06 Seoold. - 2.8792 14.08 
(D) 0.25 0.23 0221 0.34 0.12 0.06 34301) 266135 22.01 
(A): 0.506. 0.11. 0.24 0.25 0423: 0.25 3.3511 3.2443 3.19 
CR) 0.13: 0.15 0.22 0.24 0.27 06.29 3.6552 1783 13.05 
(D) 0.220.289 0217 0,15: 0.140.123 3.6552 2.872). 21.42 
(A) Os 005% 0.25 0.19 0.779: 0522 3.6552 3.3046 9.59 
(RY 05320012 0233 0.13: 0.02: 0.02 26967? 2.3329" 12.70 
(D): 03.33°0.21 O213°0.11..0208 0.04 2.7677 2.2494 18.73 
(AY 05-04-0706 O22) 0273. 0:31 0.33 2etb7Y 2.9512) (0.60 
CRY 05-20 O42) OF 02 OES 0426 3.4096 3.0602 25 
(Dp) 0.26 0.21 0.138 0.13 0.12 0.10 3.4086. 226510. 22595 
CA) 0620 OL. 0.12 0. bBo O22 0s.26 3.4096 3.2539 4.59 
(Py. Oot 0. TE? OTS: 0572: 0213 obs 82 3.0923. 2,520 9" “Ae 33 
(Dp) 0 O63 OES O53 0. Us Oe te S.0S 23> Peso Lis 
Ci) Oakes OOS 22D. LS Ooh. Oat Oi SOs DOES s2ans 

(pendant 

(DB) : descending order 

(A) - asean a 

Table 3.1 Ceraparicons Between Ascenilig, Uadom apd Uoscending Orders 


for mM=p=% and n=p=0. 


ae) 
ie) 
~~ 


U 
_ 
Rey 
— 


EXACT 


AM1 


Table 6.2 


AM2 RD2 RAAM 


RA 
(1) O.44 3.02 6.04 0.88 1.1364 1.4518 27.75 1.2400 9.12 1.1364 8.9 
(2) G.GG O.1TT 0516 0.67 1.4921 2.90821 39.54 1.8045 20.94. 1.4919 6.01 
(SY <Oel2 M6. Ost? 9654 1.B20: 2.4099 30.83 252702 20.42: 1.8365 6.39 
(09) Vel? OsI3 Oe25 Oe96> 2s LL L12552905 19.82) 2.4924 14.75 2.0939 6.81 
CS:) Sais ws 20 Ua 33) Ook 2.2083 2.4870 12.62 2.42682 9.96 2.1998 0.79 
(CG) 8.13 9.23 0.25 6.31 2.5416 2.7009 6.27 2.6868 5.71 2.4182 4.86 
(7) 0.15 9.22 U.28 6.35 2.1198 2.6606 8.69 2.6299 7.35 2.3709 3.22 
Cade se SIG: 04 25 Go. 26270. 23 734d a. Osa 40. 2. dO Le 8.59 
(SO) O.13 0.239 3.22 O.'6 2.1154 2.5415 20.14 2.4327 15.00 2.9952 9.95 
(10) O-.18 0.22 0.30 G.30 2.5345 2.6975 6.43 2.6823 5.82 2.4148 4.72 
(11) 6.15 6.16 6.29 9.46 2.2965 2.5764 13.06 2.5340 10.34 2.26607 1.56 
(12) G.13 G.15 6.21 0.51 1.9398 2.4579 26.71 2.2994 18.54 1.9322 9.39 
(13) 0.03 Jalo 0.35. 0V38-- 222965 244902 12.86 234247 92.89 - 2.1878 B<85 
(li) 0.06 1.13 U.24 6.57 1.7457 2.2785 309.52 2.9884 19.63 1.7447 6.06 
(15) 0.5955 0.15 0.35 0.45 2.80765 2.3938 15.61 2.3118 11-65 :-2.0648 8.28 
Meniory Bundy idths Derived From Rxact Solution and Three Approvmiation 
Dhiet nts ke Vaisjeee. 
Table 6.3 
ECL) Pel) PCy Bay PS) Rie). RKReT AML RD1  AM2 RD2 RAAM RD. a 
(1) 6.11 0.12 6.13 6.15 6.19 @.38 3.1892 3.8278 20.02 3.7154 16.47 3.8862 3.23 
(2) 6.02 9.54 6.10 6.18 06.19 0.47 2.1260 3.1914 50.11 2.8178 32.50 2.1260 6.06 
C20: Gx Ge Sel Bela oe VOY OW 28" 19490503. 8049.46.06 9.7550 13.95; Sel G65 3.09 
(3) O.G7 0.13 0-14 6.19 9.21 0.26 3.3874 3.8251 12.92 3.7581 10.94 3.2525 3.98 
(5) 9.0) O.11 0.13 9.26 9.22 0.25 3.4079 3.8362 12.57 3.7686 10.58 :3.2674 4.68 
(6) O.09 O.11L G-14 9.18 0.20 0.28 3.3000 3.8251 15.91 3.7373 13.25 3.1924 3.26 
Py  Uior Crit Cele Bats Grek O32 330207 3. 7324 23:50 355958) 26.71 2.9752 .1-.51 
(8) 6.16 G.16 0.17 6.17 6.17 6.17 3.7781 3.9896 5.60 3.9891'5.58 3.1963 15.56 
(9) 9.16 0.13 0.14 0.17 06.19 0.27 3.3886 3.8697 14.20 3.7937 12.19 3.2246 4.84 
(15) 0.10 U.11 6.13 0.17 6.19 9.30 3.1899 3.8169 19.81 3.6971 16.23 3.09954 2.69 
(il) 0.86 O.11 Geld G521 0023 0.25 39,3571 3.7791 12.77 3.7878 10.64. 3.2443. 3.19 
(12) 0.13 6.14 0.15 0.17 6.19 9.22 3.6552 3.9591 8.07 3.9278 7.46 3.30846 9.59 
Lay O06 OO 08 06: Og. BiG 0.33 92.9677 344819. 28.80 3.7790. 16.047 2%. 1512 0.60 
(14) O.15 6.12 0.13 6.18 G.21 Y.26 3.4896 3.8592 13.19 3.7913 11.19 3.2539 4.59 
(15) 0-12 3.12 9.13 6.15 6.16 6.32 3.6523 3.8193 24.83 3.6652 20.98 2.9713 2.65 
aad Es Derived From Exact Solution and Three Approximation Solutions 
; ne | 
7. Conclusion REFERENCES 


The system structure and program behavior are 
two major factors which influence the performance of 
interleaved memories. The latter has usually been 
ignored in most previous studies. A simple model for a 
multiprocessor system with p processors and m 
memory modules was presented. In this model, a set of 
non-uniformly distributed probabilities P(i) is employed 
to illustrate the program behavior, but no distinction is 
made between processors. 


One exact solution and three approximation solu- 
tions were proposed in order to evaluate the perfor- 
mance of interleaved memory. Since it may take an 
unbearable time to compute the exact solution for 
medium to large m and p, there is a definite need to 
explore some approximation methods. Among all three 
approximation methods, RAAM gives the best result and 
AM2 is next. The approximation solutions computed by 
AM1 and AM2 are always optimistic. The results of 
RAAM, on the other hand, are always more conservative. 
An inverse relation between average memory bandwidth 
and average request completion time was also obtained. 
AM2 is derived from this inverse relation and the result 
of AM2 could be improved if a more precise way to cal- 
culate the average completion time could be found. 
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ABSTRACT: A Markovian queueing network model is 
developed for the performance analysis of multiproces- 
sor systems with multiple time-shared-bus interconnec- 
tion networks. The effect of memory and bus conten- 
tions is investigated, and the comparative results from a 
unibus to a crossbar system are presented. The results 
show that decreasing the number of busses in a 
crossbar switch by factor of two produces a negligible 
degradation on the system performance in most cases. 
We have obtained exact results by devising an algo- 
rithmic method to convert the Markov chain of the 
queueing model to a simple birth-death process. 


I. INTRODUCTION 


Before starting to design and implement a real sys- 
tem, it is necessary for a designer to estimate the per- 
formance of a proposed system for given values of input 
parameters by applying analytic or simulation methods 
to the mathematical model of the system. Analytic 
models are very useful for a designer because they allow 
one to explore the effects of variations of system design 
parameters quickly and rather economically compared 
with simulation models. 


Our study is concerned with devising an exact ana- 
lytic model for the performance evaluation of a typical 
multiprocessor system (Fig.1), where each processor 
has its local memory unit and the allocation of common 
resources is controlled by the controller unit. The 
dynamic structure of the Interconnection Network(IN) 
enables the system to reconfigure the links between 
processors and the common memory. One way to realize 
this is to use multiple time- shared busses. When a pro- 
cessor requests access to the common memory, it sig- 
nals the controller for a connection to the referenced 
module. Requests for connections are assumed to be 
independent from one processor to another, and more 
than one processor can request access simultaneously. 


Since both the common memory and data paths are 
shared, contentions may arise, causing processors to 
queue for a resource which is currently in use. If every 
processor can be connected to a free memory module 
without blocking, the only cause of contention would be 
the common memory. If the referenced module is busy 
at the time of a request, then the controller puts the 
processor's request in a FCFS queue which is assigned to 
the referenced module. We call this type of systems 
Bus-Sufficient(BS) multiprocessor systems. The IN of BS 
system can be implemented by a full crossbar switch 
network. If ‘IN is a blocking network, then a processor 
may have to wait for a free bus to access the common 
memory even if the referenced module is free at the 
time of the request. We call this type of systems Bus- 
Deficient(BD) multiprocessor systems. 


We are concerned with determining the effects of 
the following two factors on degradation of the system 
performance: 


1) 


Several processors may simultaneously request 
access to the same memory module, or a 
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Figure 1 
Basic block structure of a typical multiprocessor system 


referenced module might be busy at the time of a 
request, so that some processors remain idle for 
several memory cycles. This is called memory con- 
tention. 


2) Ifa blocking occurs in the IN, then some processors 
remain idle for several memory cycles until free 
busses become available for access to the common 


memory. This is called bus contention. 


Skinner and Asher [6] were the first to use Markov 
Chain(MC) models to analyze multiprocessor systems. 
Unfortunately, their method can be applied only to 
small- scale systems. To analyze larger scale muitipro- 
cessor systems, Some approximate and algorithmic 
methods have been proposed [1,2,7]. 


I]. ASSUMPTIONS AND PERFORMANCE MEASURES 


We have made the following assumptions for the 
mathematical model of multiprocessor systems with 
multiple time-shared busses. 


1} When a processor requests access to the common 
memory, a connection is immediately established 
between the processor and the referenced module, 
provided that the referenced module is not being 
accessed by another processor and a bus is avail- 


able for the connection. 
2) A processor cannot have another memory request if 


its present request has not been granted yet. 


3) The duration between the completion of a request 
and the issue of the next one to the common 
memory is an independent, exponentially distri- 
buted random variable with the same mean 
value,1/A, for all processors. 


4) The duration of an access by a processor to the 
common memory is an independent, exponentially 
distributed random variable with the same mean 
value,1/y, for all memory modules. 


5) The request for access from a processor to the 
common memory is uniformly distributed for all 
memory modules with probability 1/m. 


If a queueing model satisfies the assumption (5), 
then it is called a Uniform Reference Model(URM). We 
use the traditional Markovian queueing network theory 
[4] approach for analyzing the multiprocessor systems 
with the assumptions stated above. 


To overcome the computational complexity of the 
exact queueing model for the performance analysis of 
multiprocessor systems with multiple time-shared 
busses, several approximate models have been proposed 
by some researchers [3,5]. However, we have obtained 
exact results by devising an algorithmic method to con- 
vert the MC of the queueing model to a simple birth- 
death process, which is equivalent to the original MC. 


The goal of the analysis of the queueing network 
model is to derive values of some performance meas- 
ures of multiprocessor systems. The expected value of 
percentage of active processors is known as Processing 
Efficiency(PE) of a multiprocessor system. Let PE! and 
PE” be processing efficiencies of two different systems 
such that they have the same parameters as the original 
system except the former is a unibus system and the 
latter is a crossbar system. It is clear that 
PE'<= PE <PE*. In fact, PE’ and PE* are the lower and 
upper bounds for the processing efficiencies of a family 
of multiprocessor systems such that the number of 
busses in the IN is the design parameter for this family. 
We denote the bus effect factor by é and define it as fol- 
low: é = (PE®—PE)/ PE?. 


lil. EVALUATION 


The configuration of a multiple bus multiprocessor 
system is usually denoted by a 3-tuple (pxmxb), where 
p,m,b are the number of processors, number of memory 
modules, and the number of busses, respectively. If 
b=min(p,m) then the system is a BS system because a 
bus is available whenever a processor requests access to 
a free memory module. If b<min(p,m) then the system 
is a BD system because a processor may have to wait for 
a bus to access the referenced memory module even 
though the module may be free at the time of the 
request. If all the busses are occupied when a processor 
requests access to a free memory module, then the con- 
troller unit puts the processor in a wait state until a bus 
is available. 


Bach state Q of the continuous-time MC of the 
queueing model is represented by an m_-tuple 
Q=(k,,....km,), where |k;| indicates total number of pro- 
cessors queueing for memory module j. If k;>0 then jth 
memory module is busy, but if k;<0O then there is no 
available bus to access this module. If total number of 
active processors is p-n for a state, then we call it a level 
n state. Let U(x) be a binary variable such that U(x)=1 


for x<O and U(x)=0 for x=0. If 20k; -)=t for a state, 


then we call it a type-t state. A oe -O0 state is also 
called a BS state and a type-t state for t=1 is called a BD 
state. If all states at level n of the MC are BS states, 
then we call least one BS state Q°=(n,0....,0). 


To distinguish the probabilities of states at a given 
level of the MC, we attach weights to the states. The 
weight of a state Q is defined by 


W[Q] = Pr(Q)/Pr(Qo) (1), 


where Q° is a BS state at the same level with Q. Let ®(n) 
denote the set of states at level n of the MC and 
L(n)=Pr{Q|Qe6(n)}, then by definition of weight of a 
state we have shown that 


L{n) = (An-1/ #) [e(n)/ «(n-1)] L(n-1) or 
Jin L(n) = An—L(n-1) with (2), 


An-1 = An-1/ «(n-1) and Z, = w/i(n) for n=1,...,p 


where k(n) = 5) W[Q] is defined as the weight of level n 


and Ay, =(p- ae ‘The above equations suggest that we 
can replace the MC of the queueing model by a simple 
birth-death process with parameters A, and “,, for state 
n, which corresponds to level n of the original MC. 


By analyzing the birth-death process and by defini- 


tion of PE, we have obtained 
-1 a 2171 
3 e(n)(p-1)n( 2") | Yemeoa (£) | (3), 


n=0 


PE = 


where p=A/y (utilization factor for a single-processor 
single-memory vee and (p),=p(p-1)...(p-n+1) with 
(p)o=1. 


We like to reduce the size of the MC by partitioning 
the states into equivalence classes and generating a 
lumped MC for the system. If Q=(kj,..,k,,) is a state of 
the MC, then the states C={(kj,,....Kjm)|(71....3m JEP» 3 
form an equivalence class of Q, where P,, is the set of all 
permutations of integers 1,....m. Let ¥(n) denote the set 
of equivalence classes at level n of the MC. Since the 
weights of states of an equivalence class all have the 
same value, we can define the weight of an equivalence 
class as 

W[C] = N[C] x W[Q] with QeC (4), 


where N[C] is the number of states in C. This equation 
implies that x(n) = )) W[C]. To determine the weights 


CE¥(n) 
of levels of the MC, we devise the following algorithm. 


The Algorithm 
1. Initialize n=O and set the weight of level 0: «(0)«1 
Jnentl 
¥(n)«set of equivalence classes at level n; «(n)<0 
. Cean element of ¥({n) 
. Q«the representative state of C; /*Q=(k,,....kK,)*/ 
N({C]«the number of elements of C 
6. se a §C} 


7. Bed #0) 
=1 
8. if se then W[Q]<1 
9. else do 
a. j-1; WLQ]<0 
b. cf ¢(|kj|[>1) OR (k;=1) 
c, if c; =0 then goto 9g 
od. [kj |@|ky x 1; sign(k; )<sign(k;) 
e. Me(k),..., Km) 
f. wlalewial : ! Wa] 
g. jejrl 
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h. if j<m then goto 9b 
i. W[Q]<W[Q]/b; end 
10. W[C]+N[C] x W{Q] 
11. e(n)ex«(n) + WLC] 
12. if Vin)#¢ then goto 4 
13. if n<p then goto 2; end 


IV. NUMERICAL RESULTS 


We have implemented our algorithm in PLC for the 
MTS(Michigan Terminal System) and the program was 
run for several pxmxXb configurations. We have seen 
that execution time increases rapidly with p and by the 
factor of p-b for a fixed value of p. Therefore, we can 
say that the proposed algorithmic method is not very 
suitable for very large-scale multiprocessor systems. 


We can compute the PE of a pXmxXb multiprocessor 
system by applying equation (3). For p,m,b= 
1,2,4,8,12,16 and p=[0,i], PE vs p values are depicted in 
Fig.2: Fig.2a shows the effect of number of processors 
for m=8 and b=4. Fig.2b shows the effect of number of 
memory modules for p=B and b=4. Fig.2gc shows the 
effect of number of busses for p=m=16 on the system 
performance. The numerical values of € vs p are tabu- 
lated in Table i for 16x16xb family, where b=? 
corresponds to a BS system. We see that decreasing the 
number of busses to b=8 makes a difference of only 0.7% 
inthe PE{p=1). Thus, 16x16x16 system can be replaced 
by a 16x16x8 system without any significant degrada- 
tion of the system performance. 
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Table 1 
Bus effect factors £(%) for 16*16Xb family of multiprocessor systems 
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Abstract 


This paper presents a formal model of linear array processors 
suitable for VLSI implementation as well as graph representation 
of programs suitable for execution on such a model. A 
distinction is made between correct mapping and_ correct 
execution of such graphs on this model. A complete 
characterization of the structure of a class of correctly mappable 
graphs is obtained. The formalism developed is used to synthesize 
algorithms for this model. 


1. Introduction 


In [3, 6] specialized array processors were proposed as a means 
of handling compute-bound problems in a cost-effective and 
efficient manner. These array processors generally consist of a 
regular array of simple, identical processing elements which 
operate in synchrony. A host computer drives the array as a 
peripheral. The array can be of many forms, for instance a linear 
array, a rectangular mesh, a hexagonal mesh, etc. Simplicity and 
regularity of these array processors render them suitable for VLSI 
implementation. High performance is achieved by extensive use 
of pipelining and multiprocessing. 


A variety of algorithms have been designed for such arrays (1, 
4, 9]. An algorithm executing on such arrays is comprised of 
_ several data streams. A data stream is unidirectional, i.e., it does 
not change directions as it passes through processors in the array. 
Elements in distinct data streams move at different velocities 
(processors / cycle) while all elements in a given data stream 
move at the same velocity. Every processor in the array regularly 
receives data from each of the data streams, performs some short 
computation, and pumps the data out. The array communicates 
with the host through certain input/output ports designated as 
external input/output ports and elements in distinct data streams 
are pumped in through distinct external input/output ports. We 
will henceforth refer to such algorithms as “array algorithms®. 


A few methodologies have been proposed for synthesizing array 
algorithms from program specifications [2, 5, 12]. However in all 
these methodologies the synthesis problem was not studied in a 
formal framework. Also these methodologies shed insufficient 
insight into the synthesis problem for lack of a more intuitive 
representation of programs. 


In this paper we study the synthesis of array algorithms in a 
more rigorous framework using a more intuitive representation of 
programs, namely, data-flow descriptions of programs. In 
particular we will be studying the synthesis of algorithms for a 
linear array. The array is comprised of identical processors, that 
is, they all execute the same set of instructions in every 
instruction cycle, and they are all simple, that is, they do not 
have any addressable local memory and cannot perform 
branching. The linear array is driven either by a single-phase or 
two-phase global clock [7]. In a two-phase clocking scheme the 
two phases are nonoverlapping and adjacent processors are 
activated by the opposite phases of the clock. 
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Two reasons motivate our study of such a model. Firstly, this 
model has been used for most of the published array algorithms. 
Secondly, and more importantly, linear arrays require a fixed 1/O 
bandwidth. Hence they can be attached as a peripheral to the 
1/O bus of any existing host without requiring any change to the 
host's 1/O bandwidth. 


We formalize this linear-array model and then define the 
program graphs that are appropriate for execution on them. A 
program graph is a directed acyclic graph representing a 
computation. The edges represent values and the nodes represent 
computation of a function whose arguments are the values 
represented by the incoming edges. We distinguish between 
correct mapping and correct execution of such program graphs 
on the linear array model. We provide a complete syntactic 
characterization for a class of program graphs (i.e., identify 
structural properties) that are correctly mappable and briefly 
mention the importance of using some semantic knowledge (i-e., 
some property) of the function represented by the nodes in the 
graph to correctly execute the graph. 


This paper is organized as follows - in section 2 and section 3 
we introduce the linear array and program graph models 
respectively. In section 4 we formalize the notion of correct 
execution of program graphs on linear arrays and in section 5 we 
examine the structural properties of correctly mappable program 
graphs. We illustrate the formalisms developed by synthesizing 
some linear-array algorithms. For brevity, the proofs of most of 
the Theorems in this paper have been omitted. They can 
however be found in [10]. 


2. Linear Array Model 


In this section we define the linear array processor that 
formally captures the intuitive linear array described in the 
previous section. A linear array is a 3-tuple Ar=<N,L,..Yap> 


as follows. 


1.N is a sequence of identical processors with indices 
ranging from 1 to |N]. 

2. Ly == {H, 12, .., Ik} is a set of labels. 

3. Every processor in the array has k input ports and k 
output ports, with each input port and output port 
assigned a unique label Jj from L,,. Each processor in 
N is connected to its neighbors in the sequence 
through its I/O ports. In addition the first and last 
processors may have input and output ports 
connected to the host environment. 

4. The array is driven either by a single-phase or a two- 
phase global clock. A phase can be viewed as the 
instruction cycle of a processor. In a single-phase 
clocking scheme all processors are activated in every 
phase and every processor computes a k-ary function 
Vi, In a two-phase clocking scheme adjacent 
processors are activated during opposite phases of the 
clock and every processor computes ¥,. in the phase 
it is active. 


The function ¥,. computed by a processor is a straight-line 


program. This restriction is imposed since we have assumed that 
a processor does not have any branching ability. We will 
henceforth refer to a processor in the array by its index in the 


sequence N.Let s be the index of a _ processor. Let 
si,= <sil si?,..,sif > denote the k-tuple input to processor s at 
time t where sii is the value at the input port labelled lj of 
processor 6 at time t. Let so, = <sol so?,..,sok > denote the k- 
tuple output computed by processor s at time t, 
W, (si, )=80,. 


1.€., 


For any label Jj in L,., let Pr; be the neighborhood: relation 


imposed by label /j on processors in N. Let <s,r> be any pair of 
processors in N. 


Definition 2.1: We shall say that processor s is related to 
processor r by label Jj denoted as s pr, iff the output port 
labelled /j of s is connected to the input port labelled lj of r. 


We will refer to a path of uniform labels through the array as a 
data stream. The linear array has the following communication 
features. 


1. A processor in the linear array can only communicate 
with up to two neighbors. All data streams are 
unidirectional. Hence for any label Jj in L,., if 1; is 
not an empty relation, then a neighborhood constant 
n,, is associated with /j) such that the output port 
labelled lj of any processor s is connected to the input 
port labelled Jj of s+n,; where n,; is one of {1, -1, O}. 

2. The elements in a data stream move at a constant 
velocity, and hence a non-zero positive delay constant 
dj; is associated with every label /j in L,, such that 
for any processor 5, if so, is the output computed by s 


at time t then sol appears at the input port labelled Jj 
of processor Stn); at t+d),. 

3. External communication takes place through certain 
designated input/output ports namely, 


a. if Pi; is empty then the input port and output 
port labelled lj of every processor communicate 
with the host, 

b. if n=l then the input port labelled Jj of 
processor 1 and the output port labelled Jj of 
processor |N| communicate with the host, 

c. if nyj=-1 then the input port labelled Jj of 
processor || and the output port labelled Jj of 
processor 1 communicate with the host, 

d. if n,;=0 then a register in every processor 
serves as the input/output port labelled lj. No 
input/output port labelled /j communicates 
with the host. A value is preloaded into this 
register before starting the computation and the 
result value (the preloaded value may be 
updated as computation progresses) is retrieved 
from this register after the computation 
terminates. 


We will call the input/output ports that communicate with the 
host external input/output ports. 


The delay dj; can be implemented as a queue using a shift 
register of length dj-1 if single-phase clocking is used and of 


44] 


length (d,;-1)/ 2 if two-phase clockizg is used. At any time t, then, 


an activated processor s in the array performs the following 
sequence of operations: 


1. Compute ¥, (si,)=so, where si,==<si!, si?, .,sik> 
and so,=<sol, so’, 80K >, 

2. For every label /j, dequeue the element at the head of 
the queue associated with /j) and place it at the 
output port labelled Jj of s. 

3. For every label Jj, place so} at the tail of the queue. 


Figure 2.1 illustrates a linear array with n,,=1, n,.==-1, n),=0. 
The neighborhood relation »,, imposed by label [4 is empty. 
W/O» Trg /Oje, Iy3/O;3 and J,4/O,, are the external input/output 
ports associated with labels 1, /2, 13 and /4 respectively. 
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af 
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Figure 2-] 


Henceforth, "linear array (arrays)" used in the rest of this 
paper will refer to the model defined above. 


3. Homogeneous Graphs 


The linear array is comprised of identical processors all of 
which compute the same function (or execute the same 
instruction ) in every cycle. All the processors in the array 
cooperate in executing a single program. As all the processors in 
the array are identical, the straight-line programs they execute 
must also be identical. This motivates the following 
formalization of programs appropriate for execution on linear 
arrays. 


A homogeneous program graph G=<V,E,Lg> is a labelled 
DAG where: 


1. V=V,USO,USI,, and Va SOgq and SI, are three 
disjoint sets of vertices with SOg the set of source 
vertices, SI, the set of sink vertices and Vg the set of 
remaining vertices, which we shall call computation 
vertices, 

2. Lg is a set of labels. Let |L,, |=k, and 

3. every vertex in Vag has k incident edges and k 


outgoing edges, where each incident and outgoing 
edge is assigned a unique label from Lg. 


Input edges and output edges in G are those edges that are 
directed out of and into source and sink vertices respectively. 


In any execution of G on a linear array, every computation 
vertex in G is a single instance of a function evaluation that is 
performed in a cycle by a processor in the array. Hence the 
function represented by v, then, must be a straight-line program 


and we can view the k incoming edges and the k outgoing edges 
of a vertex v, as representing the k-tuple input value and k-tuple 


output value computed by the processor that evaluates v,. A 


source vertex then, represents an input value and a sink vertex 
represents an output value. As every computation vertex 
represents the same function, we refer to these program graphs 


as Homogeneous Graphs. 


Figure 3.1 illustrates a homogeneous graph. The solid and 
dashed horizontal edges are labelled /1 and [2 respectively. The 
vertical and oblique edges are labelled /3 and [4 respectively. 


Figure 3-1 


In Figure 3.1 and in all the other graphs illustrated in this paper 
we will be using ’e’ to represent computation vertices and ’x’ to 
denote source and sink vertices. 


Although homogeneous graphs are a more limited class of 
program graphs than, for instance, general dataflow graphs, it 
does allow the representation of quite a number of interesting 
programs which are potentially suitable for execution on the 
linear array model. As we shall see, not even all homogeneous 
graph programs can be executed on the simple computing engines 
we have defined. 


Henceforth we will assume the following: 


1. G is a homogeneous graph. 

2. The label of a source (sink) vertex is the same as that 
of the input (output) edge directed out of the source 
(directed into the sink) vertex. 

3. Input (output) value will always refer to the value 
represented by a source (sink) vertex. 


4. Mapping Homogeneous Graphs 


We now give a precise formulation of correct mapping and 
correct execution of homogeneous graphs on linear arrays. 
Intuitively mapping of G onto a linear array Ar assigns each 
computation vertex of G to a processor in Ar at a particular time 
step and also fixes the delay and neighborhood constant for every 
label in Lg. Assuming discrete time steps, let T={0,1,2,..} be 


the sequence of natural numbers representing the progress of 
computation from its start at time 0. 


Definition 4.1: 
4-tuple <PA,TA,NA,DA> where: 
functions mapping computation 
processors and time steps respectively. 
2.Let I* be a set of positive non-zero integers. 
NA:Lg-->{1,-1,0} and DA:L,-—->I* are many-one 
functions assigning neighborhood constants and 
delays to labels respectively. 


[Note: NA(/j)=n,; and DA(lj)=d, 


are 
vertices 


many-one 
onto 


A mapping of G onto a linear array Ar is a 
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We next formalize a correct mapping. 


Definition 4.2: A mapping is syntactically correct iff 


1. VIjEL,, and for any pair of computation vertices, v, 
and Vy) if there is an edge labelled /j directed from v, 


to v, then PA(v,)=PA(v,)+n;; and 
TA(v,)=TA(v,)+d),, and 
2.no two input/output values can appear 


simultaneously at the same input port of a processor. 


Let ¢ be the input value represented by the source vertex of a 
computation vertex, say, v,. Similarly, let o be the output value 
represented by the sink vertex of another computation vertex, 
say, Vy. Without loss of generality, let the labels of the source 
and sink vertices be /j. Now ¢ is fed into the array and o is 
retrieved from the array through the external input port and 


external output port respectively associated with label Jj. Let 


TA(v,)=t, and TA(v,)=tp. 


Definition 4.3: Entry Time for : and Exit Time for o is the 
time at which 3 is fed into and o is retrieved from the array 
respectively. Consumption Time of ¢ and Production Time of 0 is 


t, and tot+d); respectively. 


We are now in a position introduce the notion of correct 
execution of homogeneous graphs. 


Definition 4.4: G is correctly executed on a linear array iff 


1. the mapping is syntactically correct, and 

2.for every input value its value at entry and 
consumption times must be the same and for every 
output value its value at production and exit times 
must be the same. 


Intuitively condition ) means that we may be required to 
maintain a value input (outputted) to (by) the array constant as 
it passes through some number of processors inorder that it 
arrive unchanged at a processor (external output port) that will 
use it (from which it will be retrieved). 


5. Syntactic Characterization 


Our aim is to identify the structure of homogeneous graphs for 
which there exist syntactically correct mappings. We begin by 
identifying the relevant structural elements of a homogeneous 
graph G. 


Definition 5.1: For any label Jj in G, a major path labelled Jj 
is a directed path from a source vertex v, to a sink vertex v 


such that the label of v,, Vy and all the edges in the path is Jj. 


The path label of a major path is the label of the edges in the 
path. 


Definition 5.2: Two major paths are identical iff, ignoring the 
source and sink vertices in them, the two directed paths are the 
same. 


For any label Jj, let E,= {major paths having the same path 
label Jj}. Not every Ej; is relevant for a syntactic 


characterization of homogeneous graphs. Consequently, we divide 
the labels of G into three classes: 


1.L,={lj | there exists a pair of computation vertices 
v, and v, and a directed edge C= <VWVy> whose 
label is j. Besides for any /i and /j in L, there exists 


a major path in E;; that is not identical to any major 
path in E,..} The major paths with these labels are 
relevant for structural characterization of correctly 
mappable graphs. 

2. Let Lo={ lj | there exists a pair of computation 
vertices and a directed edge e=<v,, v.> whose 
label is lj. Besides, if lj is in L, then there exists an [i 
in L, such that for every major path in E;; there is an 
identical major path in E,,.} Given the major paths 
associated’ with the labels in L,, the major paths 
associated with those in this class are redundant for 
structural characterization. 

3. L,={lj | there exists no pair of computation vertices 

v, and Vy such that there is a directed edge 


x 
e=<v,,vy> whose label is 1j }. 


Consider the homogeneous graph in Figure 3.1 again. The solid 
and dashed horizontal edges are labelled /1 and (2 respectively. 
The vertical and oblique edges are labelled /3 and [4 respectively. 


L,={l1, 13}, Ly=={l2} and L,={I4}. 


Henceforth, throughout the rest of this paper, labels will be 
assumed to be in L, unless explicitly mentioned otherwise. 


We are are now in a position to define the class @ of program 
graphs that we will be examining in this paper. If there exists a 
connected subgraph SG in G whose label set Lgg={lp,lv}CL, 


and whose vertex set Veg contains Vq: 1.€., VgSVsq: then G is 


in @. Existence of SG signifies that there is an undirected path 
between any pair of computation vertices in G through edges 
that are labelled either Jp or lv. OQ is a large class that includes 
homogeneous program graphs for important computational 
problems like sorting, convolution, vector multiplication of band 
matrices, pattern matching, priority queue, etc. lu and lv will 
refer to the two labels of SG. 


The structure imposed on SG by any correct mapping is 
elegantly formalized below. 


Definition 5.3: Let I, and I, be two sequences of integers such 
that the sequences in I, and I, range from 0 to h, and 0 to h, 


respectively and let BCI X I. Then, SG is a Mesh Graph iff 
there exists a one-one function F:Vg->B such that the following 


property holds. Let Fi, and F,, be the projection functions of F, 
ie., for any v, in Vg, if F(v,)=<m,n> then F,,(v,)=m and 
F,(v,)=n. For any v, and v. in Vg, there exists a directed path 
from v, to Vy in a major path whose path label is Iz such that 
the distance from v, to v, in this directed path is d iff 
F, (vy)=F,,(v,)+d and Fy (v,)=F,(v,). A similar condition 
holds for a major path whose path label is lv. 


Henceforth we will denote F, ,(v,) and F,,(v,) as Xj, and x, 
respectively. 


Figure 5.1 is an example of a Mesh Graph wherein the 


horizontal and vertical major paths are labelled Ju and ly 
respectively. 


Figure 5:1 
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We next relate the structure of SG to the existence of a 
syntactically correct mapping in the following Theorem. 


Theorem 5.1: If there exists a syntactically correct 
mapping for G then SG must be a Mesh Graph. 


When G is finally mapped onto a linear array the computation 
vertices in G may be partitioned into sets that comprise vertices 
which are mapped onto the same physical processor. As we will 
see later on this is useful in expressing the structure of correctly 
mappable graphs in a simple way. To formalize this partitioning 
it is useful to define a Diagonalization of the Mesh Graph SG as 
follows. 


Definition 5.4: Let w=<w,,w,>€{<1l>, <1,-1>, 
<1,0>, <0,1>}. A Diagonalization of SG is a pair <D,w> 
with the following properties. 

1.D={D,,D,, ..,.D,} is a family of ordered sets of 
computation vertices and D,UD,U .UD,=Vg. 

2. For any D,, in D, if v, and vy are in D, then 
WiXiytWoXp= WV iy t Woy iy 

3. Let YT denote the indexing function associated with 
the ordered set D. For any pair of Dy, and D, in D, if 
v, and v, are in D, and D, respectively then Tp(D,) 
< Yy(D,) iff WiXpytWorXy, < Wi 14 tWoY 1p: 

Henceforth, we will refer to D as the set of Main Diagonals and 
to w as the Main Diagonalization Factor. We will assume that 


the indices assigned to the diagonals in D range from 1 to |D| and 
if D, is a diagonal in D then T(D,)=P. ie., the index of D, in 


the ordering is p. We use the ordering of the diagonals in D to 
define an adjacency relation imposed on them by labelled edges. 

Definition 5.5: Let D, and D, be in D. D, a; D, (read 'D, 
is related by aj; to D,’) iff there exists a computation vertex v, in 
D, and another computation vertex v, in D, and a directed edge 
e= <V,,Vy> whose label is Jj. 


Definition 5.6: a,; is consistent with respect to 7) iff da 
constant m;; such that VD,€D and VD,€D, if Dy ay; D, then 


y 


We will call m,; the consistency constant of a,. Let Sp={a,; | 
HEL, and ay; is the adjacency relation on D imposed by edges 
labelled Jj }. 


It is useful to define the set De of Complementary Diagonals 


that is obtained by diagonalizing SG by its Complementar 
Diagonalization Factor Ww. where a7 and 
Let Tp, denote the indexing function associated with Dc and 


w .€{<1,1>,<1,-1>,<1,0>,<0,1>}. 
Spe={b;; | lj€L, and by; is the adjacency relation on Dc imposed 
by edges labelled /j }. Herein also we will assume that the index 


of the complementary diagonals in De ranges from 1 to {Dc| and 
if De, is a complementary diagonal in Dec then its index is 


p. Consistency of bj with respect to 7), is defined similar to ay. 
Let Cy; denote the consistency constant of bj. 


Consider Figure 5.1 again. Let w=<1,-1> and w.=<0,1>. 
Then the set of main digonals D={D,, D,, D,, D,} is comprised 
of four diagonals where D,={v,}, D,={v,}, D,={v,, vy} and 
D,={v.5, v5}. The set of complementary diagonals De={Dc,, 
Dc,, Deg} is comprised of three diagonals where Dc,={v,, vo}, 
De,={v5, V4, V5} and De, ={ve}. 


Let v, and Le be two vertices in the main diagonals D, and D, 
respectively and complementary diagonals Dc, and De, 
respectively. Then we will denote the difference in indices of D, 
and D, which is q-p as Anl¥ Vy): We will also denote the 
difference in indices of De, and Dc, which is r-s as Ap,(V,,V,). 


We next define two classes of graphs 6,C 0 and 8,C 60 where: 


0,={GES | SG is a mesh graph and the main diagonalization 
factor w of SG is one of {<1,-1>, <0,1>, <1,0>}} and 


@,== {GEO | SG is a mesh graph and the main diagonalization 
factor w of SG is <1,1>}. 


We provide a complete syntactic characterization of program 
graphs in 8, which have syntactically correct mappings in the 


following Theorem. Before doing so we introduce the notion of 
transitive edges which is needed in the proof sketch of the 
Theorem. 


Definition 5.7: Let e=<Vv,,Vy> be a directed edge from 
vertex v, to vertex v.. Then e is a transitive edge iff there exists 
a vertex v, and edges e =<v,,v,> and es=<viVy>- 


Theorem 5.2: Let GE€@,. There exists a syntactically correct 


mapping for G if and only if there exists a pair <D,Dce> such 
that each of the following conditions is satisfied: 


1. Every relation aESp is consistent with respect to 7), 
and its consistency constant m,; is one of {1,-1,0}. 

2. Every relation byESp, is consistent with respect to 
Te 

3. Let v, and Vy be any two computation vertices. For 
any label Jj if c;Ap(v,,v,)=m;Ap.(V,,Vy) then there 
must be a major path labelled /j passing through v, 
and Vy: 


Intuitively, condition (1) ensures that a data stream is 
unidirectional and communication takes place only between 
adjacent processors while condition (2) ensures that a data 
stream moves at constant velocity and condition (3) ensures that 
no two values appear simultaneously at the input port of any 
processor. 


We sketch only the sufficiency proof. The proof is constructive 
and we will be using this constructive procedure to illustrate 
synthesis of linear-array algorithms later on. 


Proof: (Qnly If): See [10] for details. 


(If Part): Let D={D,, D, 
where i denotes the index of any D,€D. Construct a linear array 
L,,. with |N|=n. 
following steps. 


, -, D,} be the set of main diagonals 


Now construct a mapping through the 


1. Choose two-phase clocking if there exists a transitive 
edge iabeiied {j such that m)j=0 or else choose a 
single-phase clocking scheme. 

2. Let D, be any diagonal in D and let v, be any 
computation vertex in D,. Then, let PA(v,)=q. This 
assigns computation vertices to processors. 

3. Next fix the neighborhood constant ni; and delay 
constant dj; for every label /j in L,. Let Dy =Mj;- Let 
d, and d, be two constants which we will be using in 
the construction of the delays for the labels in L,. If 
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the main diagonalization factor w is <1,-1> or there 
exists a transitive edge labelled /j such that m,;=0 
then let d,=2 else let d,=1. Let c,., be the 
minimum of all consistency constants among all the 
relations in Sp. If c,;,>0 then set d,=1 else set 
d=1+|cpinld,- Let =m), +e,d,. 

4. Next construct the neighborhood and delay constant 
for the labels in L,. By definition of L,, if there 
exists a label /j in L, then there must exist some label 
liin L, such that for every major path in E;; there is 
an identical major path in E,,. Hence let Dny=Dy; and 
d)=d);. 

5. For every lj in L,, let the neighborhood relation 
imposed by label /j on processors in N be empty and 
hence no processor’s output port labelled Jj is 
connected to the input port labelled /j of any 
processor. 

6. Construct the function TA which assigns computation 
vertices to time steps. Let v, be the computation 
vertex which is in D,€D and Dc,€Dc. Let 
TA(v,)=ty. Let v, be any computation vertex in 
D,€D and De,€De. Then, let 
TA(v,)=tyt(q-1)d,+(p-1)d). 


Step 1 to step 6 described above completes the construction of 
a correct mapping. Refer [10] to verify that the mapping is 
correct. 


O 


The three conditions of Theorem 5.2 are necessary but not 
sufficient for the existence of syntactically correct mappings for 
graphs in ©,. However in the next corollary we show that in 


certain cases it is both necessary and sufficient. Let GEO, and let 
C= {ey }-{epy, Cy }- 
Corollary 5.1: We,EC, if c,>0 or Ve,€C, if ¢;<O0 then 


there exists a syntactically correct mapping for G if and only if 
the three conditions in Theorem 5.2 are satisfied. 


Proof: Similar to Theorem 5.2 except in the construction of 
the expressions for the delays. If cj; >0 then set d,=2, d,=1, 


di = and d,=3. If C4;<0 then set d,=-2, d,=3, d),=3 and 
d;,=1. In [10] it is shown that this construction yields d,;>0. 
0 


The sufficiency proof of Theorem 5.2 provides a methodology 
to synthesize linear-array algorithms for graphs in O. The 
construction used in the Theorem maps a program graph 
correctly. However, very often, to ensure its correct execution we 
need to use some property of the function represented by the 
computation vertices in the graph. The structure of graphs that 


can be executed without using such knowledge is characterized in 


[10]. 


We now apply the results described in this paper to synthesize 
linear-array algorithms for computing the vector multiplication 
of band matrices and convolution. 


Example 5.1: Consider multiplication of a Band Matrix A by 
a Vector X as shown in Figure 5.2. Y is the result vector. The 
computation of this multiplication can be represented by the 
program graph in Figure 5.3 wherein Vi; denotes a computation 


vertex. The horizontal, vertical and oblique edges are labelled /1, 
[2 and 13 respectively. Let W denote the function represented by 
any computation vertex in the graph. Wis a 3-ary function such 


that for any a, b andc, V<a,b,c>=<a+tbe,b,c>. Let V1, %, WV, 
be the three projections of YW, ie., W,<a,b,c>=a+thbe, 
W,<a,b,c>=b and W,<a,b,c>=c. If a, b and c are the input 


values represented by the horizontal, vertical and oblique input 
edges of Vii then the output values represented by the outgoing 


horozontal, vertical and oblique edges of V;; are YW, <a,b,c>, 
Y,<a,b,c> and Y,<a,b,c> respectively. The input value 
represented by every horizontal source vertex is initialized to 0. 
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Let E,,={horizontal major paths}, E,.={vertical major paths} 
and E,,={oblique major paths}. It can be seen that L,={11,/2}, 
L,={¢} and L,={/3}. 

Let SG be the connected subgraph shown in Figure 5.4 that is 
obtained by removing all the edges that are labelled /3. For 
porposes of clarity SG has been drawn without the source and 
sink vertices. It can be easily verified that the program graph in 


Figure 5.3 is in O. Now diagonalize SG with w=<1,-1> to 
form the set of main diagonals D.It can be verified that 


D={D,,D,,D3,D,} is comprised of four diagonals where 
D,={V31,¥ 40% 53% 94> Do={Vo1,V 305% 43% 54% 55) 
Da={¥11,Va0%ga% 4a¥ 55h and D4={¥ 19% 93%34¥ 45)- 


Next diagonalize SG with w.=<0,1> to form the set De of 


complementary diagonals. It can be verified that 
De={Dc,,Dc,,De,,Dc,,Dc,,Dcg} is comprised of six diagonals 
where ° €.=(¥ pV}: Den={Voy,VooVog}s 
Des={¥31,V39,¥3q/V4}) Deg={¥ go,% 43% ag 45)> 


Des={V53,V5qVe5t and Deg={V¥94,Ve5}- 


In Figure 5.4 all the computation vertices belonging to the 
same diagonal in D lie on the same dashed line. Similarly all the 
computation vertices belonging to the same diagonal in Dc lie on 
one horizontal major path. 


and ¢j.=1. It can be seen that this graph satisfies Theorem 5.2. 


Next, using the construction in Theorem 5.2 we synthesize the 
linear-array algorithm in [4]. [D|=4 and hence the linear array 
has 4 processors indexed from 1 to 4. m),540 and m,.340 and 


hence use single-phase clocking. Each processor is comprised of 3 
pairs of input/output ports labelled J1, [2 and (3 respectively. 
The neighborhood relation p,, is empty. 


Let sif, si? and si? denote the inputs at the input ports labelled 


11, 12 and 13 respectively of processor s at time t and let sot, 507 


and so? denote the outputs computed by s at time t. Then 


} ieee | 20:8 ) een? 3 i3 
50, =S], +S1SL,, 50,=Sl, and 50, =Sl,- 


The computation vertices in D,,D,,D, and D, are mapped onto 


processors 1,2,3 and 4 respectively. From the construction of 
Theorem 5.2, we obtain n,,=1,nj.=-1,d),,=1 and dj.=1. The 


resulting mapped graph is shown in Figure 5.5. The time at 
which a computation vertex is mapped is indicated by the side of 
the vertex in Figure 5.5. For instance, the computation vertex on 


D, and Dc, is mapped at time t+2. For correctness of execution 
we must ensure the invariance of the two input values ih, and ih, 


at their consumption and entry times and the invariance of the 
two output values oh, and oh, at their exit and production 


times. The consumption times for ih, and ih, are t and t+1 
respectively. Table 5.1 gives the times at which ih, appears at 
the input port labelled /1 of processors 1 and 2 and ih, appears at 


the input port labelled /1 of processor 1. Any element pumped 
into I), or Ij, travels at the rate of 1 processor/cycle as 


1/d,,=1/d,.=1. Consider some row of Table 5.1, say 2. The 
entry in column 1 indicates that ih, appears at the input port 


labelled (1 of processor 1 at time t. Now ¥, is such that for any 
b, ¥,<a,b,0>=a+b0=a and hence by pumping 0 into the input 
port labelled (3 of processor 1 at t invariance of ih, at its entry 


and consumption time can be maintained. Similarly by pumping 
0 into the input ports labelled 13 of processor 1 at t-2 and 
processor 2 at t-1 invariance of ih, at its entry and consumption 


times can be maintained. 


ivy 
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Figure 5.5 


The production times for oh, and oh, are t+9 and t+10 
respectively. Table 5.2 gives the times at which oh, appears at 
the input port labelled /1 of processor 4 and oh, appears at the 


input ports labelled 1/1 of processors 3 and 4. The entries in 
Table 5.2 are interpreted in the same way as the entries in Table 
5.1. From Table 5.2 it is seen that by pumping 0 into the input 
port labelled /3 of processor 3 at t+10 and processor 4 at t+9 
and t+11 invariance of oh, and oh, at their production and exit 


times can be maintained. 


Lastly, as ¥,<a,b,c>=b for any a and any c, the input value 
iv, and output value ov, do not change as they travel through 
processors in the array. 

4 


Table 5.1 


Table 5.2 
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Example 5.2: Consider the convolution problem defined as 


follows: 


Given the sequence of weights {w,, Wo, +) w,} and the input 
sequence {x,, Xo, ++) X a} wee the output sequence 1a Yor + 


Yn41-k} defined by ee § 2» Xia 


We illustrate the eaneolation problem on n=5 and k=3. The 
computation of the convolution problem for n=5 and k=3 is 
represented by the following program graph: 


In Figure 5.6, Vi and Vj [1<i,j<3, V;; represents a computation 


vertex. The horizontal, vertical and oblique edges are labelled /1, 
[2 and (3 respectively. 


OVv4 


OSs, , 
Figure 5.6 
Let ¥ denote the function represented by any computation vertex 


in Figure 5.6. W is a 3-ary function such that for any a, b and c, 
V<a,bc>=<atbe,bc>. Let ,, WY, W, be the three 


projections of W, ie., ¥,<a,bc>=atbe, ¥,<a,bc>=b and 


¥,<a,b,c>=c. If a, b and c are the input values represented by 
the horizontal, vertical and oblique input edges of Vij then the 


output values represented by the outgoing horozontal, vertical 
and oblique edges of v;, are W,<a,bc>, Wj<a,bc> and 


W,<a,b,c> respectively. Vp | 1<p<5, va}. 1<q<3 and 
Wr | 1<r<3, let the input values represented by is) iv, and ih. 


be Xp» Wa and OQ respectively. It can then be verified that the 


output values represented by oh. is y bi x 
ad 


qQr+q-1" 
Let E,,=={horizontal major paths}, F,,—{vertical major paths} 
and E,,={oblique major paths}. It can be seen that 
L,={11,12,13}, L,=={o} and L,={¢}. 
Let SG be the connected subgraph shown in Figure 5.7 that is 


obtained by removing all the edges that are labelled /3. It can be 
seen that the program graph in Figure 5.6 is in 9. 


Now diagonalize SG with w=<1,0> to form the set D of main 
diagonals. It can be verified that D={D,,D,,D,} is comprised of 


three diagonals where D,={V,1,Vo1,V3,}, Do={V¥ o,Voo,Vgo} and 
D3={¥13:%03/¥33}- 

Next diagonalize SG with w,.=<0,1> to form the set De of 
complementary diagonals. It can be verified that 
De={Dc,,Dc,,Dc,} is also comprised of three diagonals where 


Dey={¥11,¥y2,¥13}) Deg={¥ 9192, 05}, Des={¥51,¥32,V33}- 


In Figure 5.7 all the computation vertices belonging to a single 
diagonal in D lie on the same vertical major path. Similarly, all 
the vertices belonging to a single diagonal in Dc lie on the same 
horizontal major path. 


r 
0) D2 'D, 
Dc, -- asap i -_ 
43 
Dc, “see hase das 
Y23 
Dc. «--. 
3 Vv Se ee 
, 31 'V33 


' ' 
Figure 5.7 


Now Sp={a41,372:9)3}, Sp = {by bjo,.bj3} and mj,y=1, m=O, 
Mg=-1, ¢,=0, cy=1 and c,=I1. It can be verified that 


Theorem 5.2 is satisfied. 


We next synthesize the linear-array algorithm in [6]. |D|=3 and 
hence the linear array has 3 processors indexed from 1 to 3. 


m).=0 and there exist transitive edges labelled /2. Hence use 
two-phase clocking. Each processor is comprised of 3 pairs of 
input/output ports labelled /1,/2 and /3 respectively. 


Let sil, si? and si? denote the inputs at the input ports labelled 
11, [2 and [3 respectively of processor s at time t and let so}, sor 


and soe denote the outputs computed by s at time t. Then, 
sol=si!+si? x sis, so?—=si? and sof=sit. 

Using the construction in Theorem 5.2, we obtain n,,=-1, 
Dj=0 and n,,=1. We also obtain d,,=1, dj.==2 and d),=1. The 
computation vertices in D,, D, and D, are mapped onto 


processors 1,2 and 3 respectively. The resulting mapped graph is 
shown in Figure 5.8. 
D, 


D at ~ cae List: oe an x , 
3 Yo a ee t+6 
Oo ) 
12 
Os 1 L3 
iabel ¢2 label 03 
Figure 5.8 


Lastly, we must some semantic properties of ¥ for correctness of 
execution. v, and v, are such that for any a,b and c, 
¥,<a,bc>=b and ¥,<a,b,c>=c. Hence, the input/output 
value represented by the source/sink vertices of any vertical or 
oblique major paths does not change as it travels through 
processors in the linear array. In Figure 5.8 it is seen that the 
entry and consumption (production and exit) times for every 
input (output) value represented by every horizontal source (sink) 
vertex are the same. 


Let t, be the time when the computation begins. Clearly t,<t. 
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Since nj.=O0 a register in each processor serves as the 
input/output port labelled /2. Let r,, T, and rz denote such a 
register in processors 1, 2 and 3 respectively. Then the input 
values of iv,, IVy and iv, which are w,, W. and w, respectively 
are preloaded into r,, ry and r, respectively before t.. 


6. Conclusions 


We presented a formal model of linear arrays, and 
introduced homogeneous’ graphs which are a_ natural 
representation of programs potentially executable on such arrays. 
For a large class of homogeneous graphs, a set of necessary and 
sufficient conditions on the structure of such graphs for the 
existence of a syntactically correct mapping were established. 
We then used our characterization to derive a systematic method 
for synthesizing algorithms for a class of program graphs. In [8, 
11] extensions to the class of graphs examined in this paper can 
be found. 
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ABSTRACT 


A geometric representation of array computations is 
presented. Well known "systolic" designs for computing 
polynomial product are related to one another by affine 
transformations of a three-dimensional vector space. 
Much previous work on convolver designs is unified thus. 


Designs for linear transform and matrix product are | 
unified similarly. New designs are geometrically derived — 


for these computations that are asymptotically optimal 
under the VLSI complexity measure areaxperiod®, an 
appropriate measure for high throughput applications. 


1. Introduction 


There has been considerable research recently into 
array designs (see [Kung82] for a sampling of this 
work). Leiserson and Saxe [LeSa81]| provide a general 
methodology for eliminating broadcasting from a syn- 
chronous circuit without changing its communication 
structure. Johnsson and Cohen [JoCo81] and Weiser and 
Davis [WeDa81},[JWCDB81] investigate ways of formally 
representing computational designs. Their respective 
goals are similar: To be able to formally synthesize and 
analyze computational designs taking into account 
important design properties, such as correctness, area, 
time, communication topology, and the presence or 
absence of broadcasting/pipelining. Their strategies, 
which are also similar, center around the explicit 
representation of time in arithmetic expressions via a 
deiay operator. 


In this paper we present a geometric representation of 
array designs. Our goals are similar in spirit to those of 
Johnsson, Cohen, Weiser, and Davis. We seek a unified 
framework in which to represent array designs so that 
different array designs for the same computation are 
related in a formal way. Where they use a delay opera- 
tor, we use an affine transformation. Since affine 
transformations are closed under composition (and in 
fact form a group), a single affine transformation can 
describe intuitively simple space/time rearrangements 
that may be difficult to describe as succinctly with 
other notations. A good representation, moreover, 
enhances a designer's intuition, and the geometric 
representation presented here may be helpful in this 
respect, too. It is easy to see” for example how to 
transform an array design that uses broadcasting into 
one that does not. 

Throughout this paper we use the terms bif/ word and 
serial / parallel, latency, cycle time, period, and 
completely —pipelined as defined in [CaSt83]. The 
model of computation used in this paper is the synchro- 
mous model of VLSI |Thom80, BrKu81, BPP81]. We deal 
with classes of functions and circuits that are 
parameterized by a vector, 7. For example, when we 
consider the linear transformation of a vector of 
length K, and wordsize B, the parameter vector is 
mam =(K, B). Asymptotic complexity will be measured 


' Peter R. Cappello is now with the Department of Computer Science, 
University of California, Santa Barbara, CA 93108. 

This work was supported in part by the National Science Foundation 
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with respect to a parameter vector, 7, throughout this 
paper. 

The remainder of the paper is organized as follows. In 
section 2 we introduce and apply the geometric 
representation to the problem of linear transform. In 
section 3 and 4, we apply it to convolution and matrix 
product. We finish with a discussion of other applica- 
tions and some conclusions. The full version of this 
paper is [CaStB3f]. 


2. Linear Transform 


In this section we introduce a geometric representation 
of array designs by way of example. Consider the com- 
putation of linear transform: y « Az. It can be written 
as follows: y; © )%4j°2; 


J . 
To make the example more concrete we write out these 
expressions for a 3 X 3 matrix. 


Y, © 01°21 + Aig Xe t+ Ags teed} 


Yo © Aq1'%, + Ape'Xeq + Aggy Xe 
— Y3 © Ag 2X, + Age Ze + Agg Ze 


We may think of the above set of expressions as consti- 
tuting an algorithm for transforming x into y. That is, 
the expressions indicate that each output y, can be 
obtained by computing certain specified products and 
adding them together. Considerable freedom remains 
as to how this algorithm can be implemented. For 
example, the above notation does not dictate a particu- 
lar association for the additions; any will do. The follow- 
ing recurrence relation fixes a particular association: 


Yio = 0 (2.2) 
Yiy © Yi jr t Uy Zi 
Yi = Yin 


where 7 is the size of the vector Zz. 


Now, to make this algorithm more specific, we adjoin an 
index representing time to y: 

Yijt — Yezg-it + @yz zy , for £=0. (2.3) 
We take time to be discrete, and measure it in cycles*, 
In our example, we set this time index, f, to zero. Thus 
as presently formulated, this whole computation occurs 
in one cycle: cycleg. (The meaning of this time index is 
further explained shortly.) The algorithm is not yet 
completely specified; we have not indicated a particular 
method for performing addition and multiplication. One 
must have some primitive notions. 
Definition: A computation is primitive if it is assumed 
that it can be done in constant area and with constant 
latency. We leave unspecified the algorithm for carry- 
ing out a primitive computation. 


In order to give this recurrence relation a geometric 
interpretation we take the symbol "«" to represent the 
location of its primitive computation. 


+ This notion of time does not exclude asynchronous array compute- 
tions, however. , 


Definition: The primitive computation represented by 
<« is what occurs on its right-hand side. 


In this case the computation is an inner-product-step. 


Definition: The location represented by « is the index 
values of the left-hand side interpreted as coordinates. 


Figure 2.a, illustrates the example. To properly inter- 
pret the figures please note that the meaning of a 
recurrence relation is unaffected by adding a constant 
to all occurrences of an index (e.g, Vig © Vij -1 
represents the same computation as Yi-1j-1 © Yi-1;-2)- 
Equivalently, in the geometric representation a compu- 
tation is unaffected by translation. The reader is cau- 
tioned that axes in the figures are intended merely to 
associate dimensions with indices: For ease of viewing, 
the geometric representations are translated to the 
nonnegative orthant. Figure 2.a is interpreted as fol- 
lows. At location (1,1,0) the computation 
Yiio © Yioo + @y;'2, Occurs. This means that the value 
of x; and a,,; must "be" at location (1,1,0) (i.e., at spa- 
tial coordinates (1,1) at time coordinate 0). The solid 
lines indicate the path of a particular value of z. We 
refer to them as z-value contours. The dashed lines 
represent summation paths. They denote a particular 
addition association. These lines, or contours, are 
‘intended to make interpretation of the figures more 
intuitive. Movement of data, both input data distribu- 
tion and output accumulation paths, also can be deter- 
mined unambiguously from the recurrences themselves 
by using an ordering rule. In this paper the ordering 
rule is simply the lexicographic order of the Spatial 
indices within time. So for example, if an output value 
is accumulated at more than one location with the same 
time index value, then it is accumulated at these loca- 
tions according to the lexicographic order of their spa- 
tial coordinates. We will call Eq. 2.3 the canonical 
design of the algorithm denoted by Eq. 2.2, and denote 
its geometrical representation by I. This representa- 
tion of the computation illustrates an important aspect 
of time as we represent it. If two computations, ci, and 


d,;, are located at distinct points in time, (i.e., t <s), 
then c occurs before d. Again however, if t=s, thenc 
and d take place during the same cycle, but not neces- 
sarily simultaneously.’ In fact, we have no notion of 
“simultaneity” in our interpretation of time. 


Interpreting Figure 2.a we see that the computation 
occurs in nine distinct locations in space, and at one 
location in time. 


Definition: When input information is at distinct loca- 
tions in space while at the same location in time, we say 
it is broadcast to those locations. (Again, no notion of 
simultaneity is implied.) 

Together, a geometric representation and an ordering 
rule indicate what data moves, where it moves to, and 
when it moves. The number of physically distinct com- 
putational elements is simply the number of computa- 
tional locations whose spatial coordinates are distinct. 
The topology of these elements emerges as well when we 
project out the time dimension. Thus the geometric 
representation associates a particular schedule of com- 
putation (an algorithm) with a particular network of 
computational elements. We speak of the association of 
an algorithm with a computational structure as a 
design. In what follows we apply various geometric 
transformations to the canonical design. The transfor- 
mations, then, relate distinct designs in a formal way. 


THE KUNG AND LEISERSON DESIGN 

The canonical design is not well suited to implementa- 
tion because the nine processing elements of the array 
are idle most of the time. Leiserson and Kung [LeKu80] 
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present an array design for linear transform that is 
better. Their design, denoted A, is illustrated in Figure 
2.6. The communication structure is simply a linearly- 
connected array of processors. Each processor per- 
forms an inner-product-step computation. Briefly, the 
design works as follows. X values move to the right 
through the array, while output values are accumulated 
as they move to the left. The transform coefficients 
move down through the array as indicated by Figure 
2.b. More detail is given in | LeKu80]. 


We now represent this design geometrically by obtaining 
it from transformations on the canonical design. We 
first apply Aa rotation-like transformation. We then 
“interchange” a space dimension with the time dimen- 
sion. That is, we select a dimension that represents 
space and interpret it as time. Since there can be only 
one time dimension, we must now interpret the old one 
as space. A permutation transformation, 7 is used to 
effect this semantic interchange’. The result is 
depicted in Figure 2.c. It is a geometric representation 
of the Leiserson and Kung design. To see this, one 
needs to interpret the representation. Since we inter- 
changed one of the two space dimensions with the time 
dimension, the resulting design has only one 
non —trivial spatial dimension: (Recall that the initial 
time coordinates values of the computation were all 0. 
Those coordinate values now represent a spatial dimen- 
sion. That spatial dimension is unused: it is trivial.) Thus 
the communication structure is a linearly-connected 
array. The z values move right in space over time, the 
outputs move left in space over time, as in the Leiserson 
and Kung design. The transform coefficients, z values, 
and y values move through the linear array with the 
same schedule as the Leiserson and Kung design. We 
have derived the following formal representation of A: 


1. 2. 30 001 
A=T-A(T), where R=/1 -1 0] and T=|010 
0 0 1 100 
Notice that A uses 2n-1 cycles and en-—1 inner- 


product-step processing elements. We now present a 
new linear transform design that uses 2n-—1 cycles but 
uses only n inner-product-step processing elements. It 
is illustrated in Figure 2.d. We first apply a transforma- 
tion, S, which skews the canonical representation, ['. 
The time-space interchange transformation, 7, is then 
applied as before. We interpret the result as follows. 
Due to the time-space interchange the array again 
extends in only one dimension in space. But now its 
image in space is a linearly-connected array of m pro- 
cessing elements, not en-1. X vectors, whose com- 
ponents are skewed in time, are piped through the 
array while the transformed vector’s components are 
accumulated in distinct processors, also over time. 
There is no fill-up and drainage period with this design: 
New z-vectors and transform coefficients can follow on 
the “heels” of the preceding ones. The design, denoted 
by , is related to the canonical design by the transfor- 
mations 7 and S: 

1 1.0 


OO 
00 1 


This formal derivation of a new design illustrates the 
utility of the geometric transform approach. Both 
designs A and ¥ have period O(n), (ie., are not 
completely-pipelined). We next describe a new design 
that has period O(1) (i.e., is completely-pipelined). 


V=T ° S(T). where S= 


1 This is an example of what Johnsson and Cohen refer to as "mapping 
space into time.” In this geometric representation data control infor- 
mation is implicit. 


Furthermore, it is asymptotically optimal with respect 
to the complexity measure AP*. 


AN AREA x PERIOD * OPTIMAL DESIGN 


Period rather than latency (delay) is a good measure in 
applications where high throughput (rather than short 


latency) is of interest. Before presenting our AP* 
optimal design, a few terms and facts are noted. With 
these we argue that linear transform is computationally 
a more interesting problem when the transform is 
fixed so that the input size of the problem is n (words), 
rather than n-n (words) where the transform is multi- 
plication by annxn matrix. 


Vuillemin has shown [Vuil80] that 1) linear transform, 
convolution, and matrix product are transitive func- 
tions, and 2) any circuit computing a transitive function 
at data rate D must have wire area A, >a, ‘D*, for 
some technology-dependent constant ay. 


Vuillemin's lower bound for linear transform is not valid 
for every linear transform. The Identity transform, for 
example, clearly requires wire area only linear in the 
data rate. The bound, however, is an ezistence bound: 
It says there exist some linear transforms whose area is 
O(D*), Many important transformations such as the 
Discrete Fourier Transform (DFT), however, are among 
those for which the quadratic bound holds [Vuil80]. We 
note that since the period P=n/D where n is the 
number of input bits and JD is the rate at which they are 
read in, we have that for transitive functions (such as 
the DFT) AP®(n) = O(n). 

We now explain how the AP*® complexity of linear 
transform can be dominated by 1/0. 


Definition: The aspect ratio, o, of a layout is W/L 
where W and L are its width and length, respectively. 


Most families of structures, such as complete binary 
trees, have a set of parameters (e.g., the number of 
leaves in the tree) that characterize any member of the 
family. A layout for a family of structures can in gen- 
eral have its length, width, and hence aspect ratio be a 
function of those parameters. Layouts that do not have 
constant aspect ratios are often considered undesirable 
because they result in layouts that are long and thin as 
m gets large. Lipton and Sedgewick [LiSe81] prove 
(where T denotes latency) that AT®(n) = Q(n*) for con- 
stant aspect ratio layouts whose n inputs are con- 
strained to be at the boundary of the layout. We gen- 
eralize this result with the following. 


Theorem: Let C be a circuit that computes f (n) with 
period P. Let L be a constant aspect ratio layout of C 
with area A and perimeter p. The portion of L's perim- 
eter used for input ports is denoted by p,. £, moreover, 
may have O(1) convex input ports in its interior. These 
interior ports may accommodate more than O(1) input 
bits per unit time: the area of an interior port may be 
more than O(1). Then AP*(n) = 0(n®). (See [CaSt83f] 
for proof.) 


This strengthens Lipton and Sedgewick’s result because 
the bound is retained even if a constant number of 
(large) interior ports exist, and because P < 7. 


Thus when an nxn coefficient matrix is part of a 
circuit's input, and the circuit has a constant aspect 
ratio layout, then AP*(n*) = Oin*). However when the 
nxn coefficient matrix is fixed, there is a design that 
has AP*?(n) = O(n?) as we will now show. This design 
achieves Vuillemin’s lower bound and so is asymptoti- 
cally optimal. It also shows that the previous case is I/O 
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bound: boundary placement of I/O pads dominate the 
layout area. 7 


We now describe a design that achieves the AP* lower 
bound for fixed linear transform. First the canonical 
design is skewed as before. But now instead of inter- 
changing the (trivial) time dimension with a space 
dimension, we pipeline this communication structure 
(which is a two-dimensional mesh of processing ele- 
ments). A rotation-like transform, NV, accomplishes this 
(see Figure 2.e). We denote the resulting design by A, 
where 


100 
A=N:-S(T) , where N = F 4 
| 100 
As one can see from Figure 2.e, inputs are piped 
through the mesh. Input vectors have their com- 
ponents skewed in the time dimension. Output vector 
components are similarly skewed. For this design, 
P= O(1). Since the linear transform is fixed, we can 
assume that the aj's are encoded into their proper 
inner product step computation. (Again, the inner- 
product-step is taken as primitive. That this assump- 
tion is reasonable will become clearer when we give an 
AP* optimal convolution design in section 3; the inner- 
product-step can be viewed as a bit-level variant of con- 
volution.) The area is O(n*), Thus APs(n) = O(n*), 
which is asymptotically optimal. For designs I’, A, and 
¥, the coefficient matrix is part of the input: The input 
is of size n®. By the aspect ratio theorem, I’, A, and ¥ 
have AP*(n”) = O(n"). That is, these designs are dom- 
inated by pin-in. When the input matrix is fixed, these 
designs are not optimal with respect to the measure 
AP*. 


3. Convolution 
Consider the computation of convolution. 


wy iz; Yj (3.1) 
j 

This one operation can represent an fF/FR filter, a 

Discrete Fourier Transform [RdGd75], or (when "- " 

is interpreted as a bit product and carry propagation is 

included) multiplication. Also, Foster and Kung 

[FoKu80] have noted that convolution describes string 


pattern matching when "- " is interpreted as string 
compare and "+" is interpreted as boolean and. For the 
purposes of this paper Eq. 3.1 can represent any com- 
putation where X, Y, and W are (not necessarily dis- 
tinct) sets, ':' is a map from XxY to W, and (W,+)isa 
monoid {i.e., an associative binary composition with 
identity). In this section we apply our geometric 
representation to this computation. 


In a recent article H. T. Kung [Kung82] enumerated 
seven known designs for convolution. We relate six of 
them to one another by geometric transforms. In this 
way we unify much of the work on convolution designs. 
Then we transform these to new AP* asymptotically 
optimal designs. First we establish a geometric 
representation of this computation. It is very similar to 
linear transform. Writing out an example convolution 
for z and y signals of length three, we obtain: 


Wo © Zo'Yo 

Wy FoY, + Zy'Yo 

Wee Fo Y2+ 21 Y1 + Z2‘yYo 
W3e lyYet rey; 


Wa © Z2'yYe2 


(3.2) 


We reformulate these using a recurrence relation: 


Uo = 0 (3.3) 


Wig = Wi j-1 + Zz Y-j. 

Win = Uy 
where n is the signal size. We let "«'' represent an 
inner-product-step computation located in space and 


time as before. To do this we again adjoin a time coordi- 
nate (see Figure 3.a): 


Ud 5¢ © WwW; j-1t + 2p Yi-j: ; for £=0 (3.4) 


We take this to be the canonical design of the algorithm 
expressed by Eq 3.3, and denote its geometric represen- 
tation by I’ as before. We now proceed to derive some 
known designs by geometrically transforming I‘ to them. 
Table 3.a summarizes the seven designs that H. T. Kung 
noted in his article. Figure 3.b illustrates design B1 ina 
conventional way. B1 is a design especially close to the 
canonical design. By applying the time-space inter- 
change transformation 7 to I’, B1 is derived (illustrated 
in Figure 3.c): 81 = T{(T) We interpret that figure as fol- 
lows. The same z value appears at processors that are 
Spatially but nof temporally distinct: zx values are 
broadcast to their processors. Similarly, y values 
(weights) appear at processors that are temporally but 
mot spatially distinct: y values "stay". And w values 
(results) appear at processors that are distinct in both 
space and time: they ‘’move". 


Weights 
Broadcast stay 
Broadcast Move 
Move Stay 
Move in opposite directions 
Move in same direction 
at different speeds 
Move in opposite directions 
Move is same direction 
at different speeds 


Move 
stay 
Fan-in 


Stay 
stay 


Stay 
stay 


Table 3.a Convolution designs enumerated in | Kung82]. 


We summarize the other derivations in Table 3.b. The 
transforms used are as follows (7, #, and N are as 
already defined). 


Table 3.b Geometric design definitions for the word- 
serial convolvers and their word-parallel AP* optimal 
counterparts 
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Figure 3.c Ulustrates the geometrical representations of 
the six designs. The reader is invited to compare 
interpretations of the geometric representations with 
the verbal ones in Table 3.a. Design F is not related to 
the others because "'fan-in” is usually implemented with 
an add tree and the addition association of such a tree 
is different from that indicated by Eq. 3.3. 


Nonetheless, this algorithm does admit other designs. 
We now present some AP* asymptotically optimal 
designs. Like their word-serial counterparts, these 
designs have different data movement characteristics. 
Such qualities are important in practice; a design (for, 
say, integer multiplication) may have data movement 
constraints deriving from the larger application in 
which it is embedded. For each of the six designs we 
have described geometrically, there is a corresponding 
AP® optimal design. As in linear transform, we 
transform the these six convolution designs to AP* 
optimal designs by pipelining them. In fact, the same 
transform is used. We replace the 7 transform in Table 
3.6 with the N transform. Consider design R2, for 
example. Its AP* optimal counterpart is illustrated in 
Figure 3.d. Spatially it is a hex-connected mesh of pro- 
cessors. X signal components, skewed in time, move 
along their contours (the solid lines), y signal com- 
ponents, skewed in time, move along their contours (the 
dashed lines), while output signal components are accu- 
mulated along the dotted lines. An input and output sig- 
nal can be accepted every cycle of the processor assem- 
rate ee the period is O(1), the area is O(n*), and 

(n) = O(n?), As in Section 2, Vuillemin [Vuil80] 
ahiews ae convolution is a transitive function: 
A(n) = Q(D?). Thus AP*®(n) =QO(n”), and these new 
designs are asymptotically optimal. The word-serial 
designs displayed in the left column of Table 3.b all have 

P(n) = O(n) and A(n) = O(n): none are AP* optimal. 

Finally, we note that all the designs presented have 
AP(n) = O(n*). Put another way, designs implementing 
the same algorithm all have the same switching energy, 
Ey (see [MeCo80] for a definition of this quantity); this 
energy is just distributed differently in space and time. 
Clearly, one-to-one transformations (such as affine 
transformations) conserve Foy 


4. Matrix product 


In this section we examine the matrix product computa- 
tion. After placing it in a geometric setting, we proceed 
to derive and relate the band matrix product designs of 
Leiserson and Kung [MeCo80] and Weiser and Davis 
[WeDa81]. Finally we present an AP* asymptotically 
optimal design for computing matrix product. 


Given two matrices A and B, we can denote their pro- 
duct C«A-#8. A more algorithmic description of 
matrix product is 


cy © Yay bey, for 1<ij,k<n (4.1) 
k 

where A, B, and C aren Xn matrices. We have written 

Eq. 4.1 forn = 2: 


Cy, © Gy, °° Oy, + AyQ° Bay (4.2) 


Ci2 © Qj, O12 + Ay2° Bae 
Co, © G2, 01; + Age: ba, 


Coo © Go,‘ O12 + Age’ Dae 


Again we reformulate using a recurrence relation: 


Cio = O (4.3) 


Cage © Ce e—1 + Be Og; | 
Cyy = Cin 
In order to let "«" represent an inner-product-step 


computation located in space and time we adjoin a 
fourth coordinate, time (see Figure 4.a): 


Cajet — Coge—-1t + Die Ox; | (4.4) 
We again call this the canonical design of the algorithm 


expressed by kq 4.3, and denote its geometric 
representation by I’. 


Band matrix product, an important special case of 
matrix product, is illustrated conventionally by the 
matrix expression in Figure 4.b whereas Figure 4.c illus- 
trates a geometric representation of the computation. 


lon Qi2 Q lb, bio Q 
C = |@2; Gee Aes] O21 Dae bag 
O Gge 3g O bse Og 


Figure 4.b Band matrix product 


Figure 4.d displays a summary (and conventional 
representation) of the systolic design to do this compu- 
tation that was devised by Leiserson and Kung. The 
reference [MeCo80] provides more detail. Their design, 
which we denote by A, is based on the same algorithm as 
the canonical design: Eq. 4.3. We now present a 
geometric representation of the Leiserson and Kung 
design. To obtain it one can take the staircase-like 
structure of the canonical design and situate it verti- 
cally using two rotation-(like) transforms. The A design, 
illustrated in Figure 4.e, emerges when the vertical 
space dimension is interchanged with the (previously 
unshown) time dimension (i.e., interpret the vertical 
dimension as time). The reader is invited to verify this 
informally The K matrix is a transformation that com- 
bines the three transformations just described: 


FE : -—1 1 
ms a Le ib 
1 1 0 Q 


In A, approximately one third of the processing ele- 
ments are active on any given cycle. Weiser and Davis 
present [WeDa81i] a design that improves A in this 
respect. Their design, which we denote by ¥, is dep- 
icted conventionally in Figure 4.f. Like A, it uses a hex- 
connected array. In ¥, however, the A band matrix 
flows through a row at a time, and the B band matrix, a 
column at a time, producing the C product band matrix 
a column at atime. (There is, of course, a dual design 
producing a row at a time.) ¥, obtained from I by an 
affine transformation, is: ¥ = D(I') where D is defined 
as follows: 


a 
ua 1 
D= li 60 10 

4) ee 


In ¥, all processing elements are working every cycle. 
Its throughput rate is three times that of A. 


As in convolution, other designs are possible. For exam- 
ple, Preparata and Vuillemin [PrVu80] present a matrix 
product design that is AP* asymptotically optimal; I, A, 
and ¥ are not. It is a recursive design: matrix product 
is computed by summing sub-matrix products. We now 
present a new matrix product design that is also AP* 
asymptotically optimal. The idea is to take the spatial 
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restriction of I, the three-dimensional mesh, and pro- 
ject it onto two dimensions so that each processing ele- 
ment has a unique image. (This transformation is not 
affine and not one-to-one.) We skew this intermediate 
design in order to obtain a pipelined design (i.e., elim- 
inate input broadcasting and skew output accumulation 
in time), The resulting design, denoted by A is illus- 
trated geometrically in Figure 4.g. 
f2n-1 110 

where T= : 6 : Z 
1 mio 


It has period O(1) because it is pipelined and has area 
O({n*). Thus APX(n*) = O(n*), which matches its lower 
bound. As before, the lower bound derives from the fact 
that it is a transitive function (see section 2 and 
[VuilB0}). 

It perhaps is worth noting that the projection of the 
three-dimensional mesh onto a two-dimensional space, 
resulting in some O(n) wire lengths, is necessary 
because VLSI is presently a two-dimensional medium. 
In three-dimensional VLSI, the three-dimensional mesh 
skewed in time produces a VP°(n*) = O(n°) design. 


KETEL) 3 


5. Other applications 


This technique can be applied to a wide variety of com- 
putations formulated as recurrence relations such as 
the systolic designs given in [LeKuB0]| for LU decomposi- 
tion and the solving of triangular linear systems. These 
designs involve arrays of more than one kind of process- 
ing element, an important generalization. Geometric 
representations can be applied in an even more general 
setting, however. They are suited to represent any n- 
dimensional cellular automata [Burk70] in an n+l1- 
dimensional space. A wealth of computation designs, 
thus, can be explored using geometric transformations. 


Indeed, viewed in this way it is easy to see that for every 
computable function f, there is, in fact, a completely 
pipelined (P; = O(1)) hexagonally-connected cellular 
automaton (i.e., a cell simple, systolic structure) that 


computes f. (See [CaStB3f] for details.) 


6. Conclusions 


We have presented a technique for placing array designs 
in a geometric framework and we have used this formal 
framework to relate different array designs. New 
designs were given for the functions discussed, and 
these too were related to other designs by transforma- 
tions. Design properties such as broadcasting and pipe- 
lining can be formally defined and their presence or 
absence in a particular design can be ascertained 
readily. A design's communication topology can be dis- 
closed likewise by projecting out the time dimension of 
the representation. 


Algorithms are traditionally designed for a random 
access machine (RAM). VLSI has precipitated a general- 
ization of the algorithm designer's task: a particular 
algorithm may be implemented via a wealth of different 
communication structures, each with different proper- 
ties. In the context of VLSI it has become necessary to 
distinguish between an algorithm and an implementing 
structure. We have used the noun design to designate a 
coupling of a particular algorithm with a particular 
communication structure. Recurrence relations such 
as those discussed in this paper lead to highly regular 
(array) designs and so are a convenient starting point 
for a more general investigation. On the other hand, a 
wide variety of functions can be formulated as such 
recurrence relations, so this approach has merit in its 
own right. 
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2b Conventional representation of the Leiserson 
and Kung linear transform design. 
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Figure 2.c Geometric representation of the Leiserson 
and Kung linear transform design. 
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Figure 2.d Geometric representation of the design WV. 
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Figure 2.e Geometric representation of an AP* optimal 
design A. Dotted lines trace a projection of computation 
locations onto thei — time plane. 
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Figure 3.a Geometric representation of the canonical 
convolution design. In all convolution figures, z con- 
tours are represented by solid lines, y contours are 
represented by dashed lines, w accumulation paths are 
represented by dotted lines. 


Figure 3.b Conventional representation of design B1. 


C 


Geometric representation of design B1. 


Figure 3.c Geometric representations of the designs 
enumerated by Kung. 


Geometric representation of design R2. 


Geometric representation of design Be. 


Geometric representation of design W2. 


Geometric representation of design R1. 


Figure 3.d Geometric representation of design R2’s AP* 
L optimal counterpart OR2. Dots below are the projection 
of the computation onto the i — time plane. 


Geometric representation of design W1. 
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Geometric representation of A 
matrix contours in the canonical 
matrix product design. 


Geometric representation of B 
matrix contours in the canonical 
matrix product design. 


Geometric representation of C 
matrix accumulation in the canonical 
matrix product design. 
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Figure 4.a Geometric representation 
of the canonical matrix product 
design. Time, the fourth dimension, is 
not shown in Figure 4.a. 


Figure 4.c Geometric representation of the band 
matrix product of Figure 4.b. To obtain a geometric 
representation of the Leiserson and Kung design 1) 
rotate the canonical representation such that a line 
connecting computations labeled 1 and 5 would be 
parallel to the & axis, and computations labeled 2, 3, 
and 4 all would have the same k coordinate value; 2) 
interchange the k and ¢ dimensions (i.e., interpret the 
k axis as time). See Figure 4.e. 


Ca, Caa2a Cia 


Figure 4.d Conventional representation of the Leiserson 
and Kung design for band matrix product. 


Figure 4.e Geometric representation of the Leiserson 
and Kung design for band matrix product. 
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Figure 4.f Conventional representation of the Weiser 
and Davis design for band matrix product. 


Figure 4.g Geometric representation of an AP* optimal 
matrix product design. Dots below are the projection of 
the computation onto the 1—fime plane. 
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Abstract 

A primary reason for the susceptibility of 
systolic algorithms to faults is their stron 
dependence on the interconnection between the 
processors in a systolic array. A technique to 
transform any linear systolic algorithm into an 
equivalent pipelined algorithm that executes on 
arbitrary trees is presented. 
1. Introduction 
In this paper we present an approach to 
obtaining special-purpose systolic devices that 
are robust in the face of produwtion flaws [1]. 
Due to the close coupling between camunication 
requirements and the processor interconnection 
in systolic algorithms [2], any alteration in 
the interconnection will cause these algorithms 
to fail. A technique to transfom the class of 
linear systolic algorithms [3] into equivalent 
pipelined algorithms that execute efficiently on 
an arbitrary tree of processors is described. 
Any connected set of fault-free processors may 
then be used to execute these redesigned 
systolic algorithms. 


2. Camputation Model 
Let T be a rooted tree of (m1) vertices 


am La finite set of labels. Constrwt a graph 
G to serve as’ the tree machine as follows (Fig. 
1). Consider some j € L and replace each edge 
in T by a pair of edges with label '4j' between 
the vertices joined by that edge. Repeat the 
construction for all labels belonging to 
L. Consider the subgraph G/J consisting of all 
the vertices in G am all edges laellad 'j'. 
Let BP be an Buler circuit in G with the root 
as the initial. ami final vertex on the circuit. 
Traversal of PJ induces a direction on each edge 
in the path: an edge traversed fram vertex u to 
vertex w is directed into w. Partition the 2n 
edges in P) into two equally-sized sets of solid 
am dotted edges as follows. For every vertex u 
in G@ except the root, exactly one of the edges 


of that is directed into u is solid; the rest 
of the edges are dotted. Imex the vertex into 
which the i‘” solid edge in PJ (hj) is directed 
as v,; imex the root Vo- For every other label 
k € L, the edges in the corresponding Euler 
circu are similarly partitioned so that if 
the it edge in PJ is a solid (dotted) edge 
between sane two vertices in G, then the i 

edge in is also a solid (dotted) edge between 


those vertices. 


(Amis research was supported in part by the 
National Science Fourdation umer NSF Grant 
Number MCS-8104017 and by ONR-~ Contract 
N00014-80-k-0987 
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A token is a-wnit of data that passes between 
vertices by traversal of the edges in G. Each 
token has a label from the label set L ard a 
token with label '‘'j' (called ai j-token) 
traverses only edges with label ‘'j' in G. An 
execution path for a token is an ordered 
sequence of edges in G that the token traverses. 
For notational convenience we shall often drop 
the superscript on an edge that indicates its 
label, with the umerstamiing that a token of 
the corresponding label would traverse that 
edge. We consider the case when the execution 
path for a token is the Buler circuit P amd 
distinguish two types of execution paths based 
on the direction in which P is traversed. An 
execution path P is said to be of type-l if (h,, 
Nor eser fh) is a subsequence of 5 
correspondingly, P is of type-2 if pata easy 
Nos hy) is a subsequence of P. A token that 
follows a type-1 (type-2) execution path will be 
referred to as a type-l (type-2) token. If a 
token is available at some vertex then it 
traverses the next edge on its execution path 
and becomes available at the vertex at the other 
em of the edge traversed. A token is correctly 
available at vertex v. if it is available at v. 
and either (1) if it is of type-l then the last 
edge traversed is the solid edge nh; or (2) iff it 
is of type-2 then the next edge to be traversed 
is the solid edge h; - 


e 


t 


A clock cycle is the basic time unit. Each 
edge in G has a delay associated with it, which 
is the number of cycles required by a token to 
traverse that edge. The delay for all solid 
edges of label 'j' is dh) amd all dotted edges 
of label 'j' is dceJ. If 't' is the cycle at 
which a j-token is available at some vertex, it 
is available at the vertex at the other em of 
the edge at 2 Gare trdhJ if the edge is solid, or 
at cycle t+dc/’ if the edge is dotted. All tokens 
begin their execution path at Vg at the start 
time of the token. Note that tokens move 
continuously along the edges and are not delayed 
at any vertex. 


As illustration, consider the linear array 
show in Figure 2, where vy is the host for the 
array and vertices v, to v4 are processors, 
connected by the Euler circuit P. A token is 
available at any vertex v; twice during its 
traversal of P. A type-l token will be correctly 
available at ve after traversing (h,,..,h:). 
Similarly, PB‘ is a type-2 execution path, am a 
type-2 token would be correctly available at vj 
after traversal of the edges 
(Cy poe Cg gees Mi 4))- P am P'  correspom 


naturally to the two ways of pipelining data 
through a linear array. 


At every cycle each vertex has a set of 
tokens (one of each label) that are correctly 
available at that vertex. A canputation by a 
vertex consists of a transformation of the 
values of the tokens as they pass through the 
vertex onto the next edge of their respective 
execution paths. The set of tokens which are 
correctly available at a vertex at any cycle are 
Said to meet at that vertex. 


Our notion of correctness of a computation is 
motivated by considering again the linear array 
of Fig. 2. Let dh am dc be the delays along the 
solid amd dotted edges respectively. Consider a 
type-1 token that starts from the host vg at 
cycle 't'. It would be correctly available (and 
hence used for camputation) at vertex v; at 
cycle t+dh*i. Whenever the token is correctly 
available at a vertex it meets other tokens 
which are also correctly available at that 
vertex. The simultaneous arrival of these 
tokens at that vertex results in their values 
being correctly updated. On an arbitrary tree, 
the execution paths would be sane other 
permutation of solid and dotted edges. We wish 
to ensure that all tokens which meet at same 
vertex in the linear array also meet at the same 
vertex in an arbitrary tree. 


Conditions for Correct Execution 

In this section we present the comitions 
under which the correctness criterion discussed 
above can be met. There are two cases to be 
considered: (a) when all tokens are of type-l 
and (b) when same tokens are of type-l am the 
others are of type-2. (These correspond to the 
notion of one amd two way pipelining as in [3].) 
The condition for the first case is stated in 
the following theorem. 


3. 


Theorem 1: If two type-l tokens with labels 'j' 
and ‘k' respectively meet at vertex v, in a 
linear array, then ,they meet at v; in an 


arbitrary tree iff dc) = dc*. 

We illustrate the application of the 
theoren by designing a _ robust version of 
algorithm W2 in [3] for the convolution of two 
vectors. 


Convolution Problem: Given a sequence of weights 


(Wy Woy 2007) am an input sequence 

(xX) Rg recerky ), compute the utput sequence 

el Y2 pu Yr k+1) where y; = =i Ws * Xi+se1s 
yeoeor ~k 


Solution: caer ee k=5 and let the tree machine 


be G in Fig. 1. The weights Wyree,Ws are 
preloaijed into vertices Vireo V5 respectively. 


Tokens of the input sequence are labelled 'x' 
and those for the output sequence ‘y. All 
tokens are type-1 am the delays for the solid 
edges are dh =1 amd dhY=2. The start time of 
token x;, is 'i' and of token y; 1s '1i-l1'. ‘the 
algorithm above is just solution W2 put in the 
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notation of our model. To ensure that this 
algorithm operates correctly on any arbitrary 
tree, in particular that of Fig. 1, requires dc 
ac’, and we choose the minimum possible delay 
equal to l. It can be verified that required 
pairs of input and output tokens meeet at the 
appropriate processors to canpute the desired 
values of y.. 


In a linear array of size k, if both ends of 
the array are accessible, then the latency of 
the pipeline is k*dhY. By traversing a closed 
path around the tree back to the host, the 
latency is increased to k*(dhY + dcY). However, 
this is a constant irrespective of the tree am 
the throughput is exactly the same as that of a 
linear systolic array. 


The secom case to consider is that of two- 
way pipelining - i.e one token is of type-1 and 
the other is of type-2. In this case the 
condition for correct camputation on arbitrary 
trees requires that the delays along the dotted 
edges be zero, which is impractical. However, 
we can simulate the effect of two way pipelining 
as follows. Firstly, constrain the numbering of 
vertices in the tree T so that it corresponds to 
that obtained by some depth-first traversal [4] 
of T starting fran the root (Fig.3). Secondly 
modify the type-l execution path so that a 
type-1 token now follows the shortest path 
between vy and the vertex v,, for all i=l,..,n. 
This is accamplished by allowing multiple copies 
of the token to be simultaneously present at 
different vertices in the tree. When a token 
becanes available at same vertex, it is used for 
computation by that vertex, and copies of it are 
trananitted to all adjacent vertices in the 
tree, except the vertex from which the token was 
received. This modified type-l execution path 
corresponds to the token being broadcast fran 
the root to every vertex in the tree by a series 
of local broadcast steps. Fig. 3 show a 
modified type-l execution path; note that all 
edges are solid, amd the delay along any edge in 
the path is dh) for a j-token. We now state the 
conditions for correct execution of a tw way 
pipelined algorithm on an arbitrary tree. 


Theorem 2 Let a type-l token with label 'k' and 
a type-2 token with label 'j' meet at some 
vertex v; in a linear array. Then a sufficient 
condition for the two tokens to meet at Vv; in an 
arbitrary tree Tis: (1) The vertices of T be 
nunbered by sane depth-first traversal of T 
starting fram the root, (2) The type-l execution 
path be modified as above and (3) dcJ = dh*. 


We illustrate this result with a robust 
version of algorithm Wl in [3] that employs tw- 
way pipelining, for the Oponvolution problem 
defined previously. 


Consider the tree machine of Fig. 4 where the 
vertices are nunbered by a depth-first ordering. 


The weight w; is preloajed into vertex Ve-ic 


Output tokens are of type~2 and follow the 
execution path PY shown. Input tokens follow the 
modified type-l execution path - i.e they are 
locally broadcast fran the root v, to all the 
vertices of T. As in Wl, the delays dhY and dh* 
are both equal to l. The start time of each of 
the tokens x; and y; is 2(i-1). By Theorem 2, 
delay dcY must “equal the delay dh*: hence 
dct=l. 


y, Starts from vy at cycle 0, and is used for 
camputation by vertices v, through v; at cycles 
3,5,6,8 and 9 respectively. Consider x4 which 
starts fran v, at cycle 2. It traverses the 
three edges between vg and vy in 3 cycles, and 
thereby meets y) — cycle 5 as required. 
Another copy of will simultaneously be 
present at v5 amd wll be updating Y2* Which 
started at cycle 2, am is hence correctly 
available at Ve at cycle 5. 


This method of simulating a two way pipelined 
algorithm on an arbitrary tree may not be 
possible if the type-l token changes its value 
as it passes through the array. However, for a 
large class of algorithms (for exanple see [(5]) 
this method is directly applicable . There is 
no increase in either the latency or _ the 
throughput in executing the algoritim on an 
arbitrary tree rather than a linear array. 


4, Discussion 

The tecmiques discussed in this paper can 
be used to convert any linear systolic algorithm 
into an equivalent one that executes on an 
arbitrary tree of processors. Fran the viewpoint 
of wafer scale integration this is desirable 
since a tree can always be constructed on any 
connected set of fault-free processors that are 
obtained by discarding the faulty ones. The 
method can also be seen a_ tecmique in 
converting algorithms designed for a particular 
interconnection structure into an equivalent 
algorithm on ae different interconnection 
structure. 
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ABSTRACT 


Conventional von Neumann architectures gen- 
erate addresses for referencing memory by transfer- 
ring addressing information from the memory to the 
CPU, by performing computations on addressing 
information, and by fetching and executing address 
bookkeeping instructions. Memory wait time is 
increased by computing operand addresses just 
before executing instructions which use those 
operands. These phenomena result in the von Neu- 
mann bottleneck at the CPU-memory interface. This 
work investigates one method of reducing the von 
Neumann bottleneck. 


Program referencing behavior is determined by 
analyzing dynamic address traces. The Structured 
Memory Access (SMA) architecture developed in this 
work uses a computation processor (CP) and a decou- 
pled memory access processor (MAP) with 
hardware for efficient accessing of program and 
data structures and for effective branch and loop 
management. 


Prototypical SMA machines are compared to con- 
ventional VAX-like machines. For a set of bench- 
mark programs, the SMA machine reduces the number 
of memory references to between 1/5 and 2/5 of 
those required by a VAX. Actual performance is 
very sensitive to memory bandwidth and the amount 
of unoverlapped computation; however, SMA machines 
perform Significantly better than conventional 
machines with the same parameter values. The SMA 
architecture reduces addressing overhead and 
improves system performance by (1) efficiently gen- 
erating operand requests, (2) making fewer memory 
references, and (3) maximizing computation and 
access process overlap. 


1. INTRODUCTION 


This paper concerns the interactions between 
the central processing unit (CPU) and the memory of 
a computer system, modeled as shown in figure 1 
[Hamm77]. Work performed by the CPU is partitioned 
into an access process and a computation process. 
The access process generates a stream of addresses 
for read and write requests to be serviced by the 
memory. Write data can originate from either pro- 
cess. The memory responds to read requests by gen- 
erating a stream of data and instructions which 
returns to the CPU. Some portions of these data 
and instructions, returned to the access process, 
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Figure 1. CPU-memory model 


contain information for generating more references; 
the remaining portion is used by the computation 
process to produce its output data. In our view, 
the computation process performs the desired work 


of the system, while the work done by the access 
process represents overhead which should be 
reduced. 


In conventional von Neumann architectures, the 
CPU interacts with only 1 memory, over 1 bus, and 
receives only 1 word per memory access. The compu- 
tation and access processes compete for access to 
the memory over this single, narrow path, the so- 
called "von Neumann bottleneck." 


A great deal of computing is performed solely 
for the generation of addresses. Hammerstrom 
[Hamm77] calculated the addressing overhead and the 
entropy of the stream of computation references by 
analyzing the address traces of several programs 
executed on an IBM 360. By measuring addressing 
overhead in bits input to the access process’ per 
computation process reference, he found that for a 
Gaussian elimination program and an eigenvalue- 
finding program, the addressing overhead was, 
respectively, 17.2 and 17.0 bits per computation 
reference. For a symbol manipulation program, the 
addressing overhead was 24.1 bits per computation 
process instruction fetch or memory data reference. 
These results represent a large percentage of the 
total number of bits input to the CPU from the 
memory. 


The inefficiency of the conventional access 
process is exposed further when the addressing 
overhead is compared to the entropy of the stream 
of computation references. The entropy of the com- 


putation reference stream is likewise measured in 


bits per computation reference and is interpreted | 
as the average number of bits needed to select 
among the possible successor references, i.e., to 


choose the particular next reference address given 
the current reference address. If the current and 
the possible successor reference addresses are 


known, Hammerstrom found that for the programs he 
analyzed, between .845 and 1.86 bits of information 
per computation reference are needed to select 
among the possible successor reference addresses. 
These values can be treated as lower bounds on the 
number of bits which would be needed to specify a 
successor reference. Comparing these values to the 
addressing overhead, we find that they differ by at 
least an order of magnitude. Thus significantly 
more bits than necessary are being transferred 
between the memory and the access process during 
the execution of a program. 


This access overhead and hence the von Neumann 
bottleneck can be reduced if the activities of the 
access process can be (1) modified to reduce the 
number of times the memory is accessed and (2) 
overlapped with those of the computation process. 
This paper introduces a Structured Memory Access 
(SMA) machine organization which includes’ mechan- 
isms to take explicit advantage. of a program's 
structure and of the regular patterns in which data 
structures are referenced. A more detailed presen- 
tation of the SMA architecture is found in 
[Ples82]. This architecture shares some goals and 
implementation characteristics with Smith's decou- 
pled access/execute (DAE) architecture [Smit82], 
notably the decoupling of and separate processors 
for access and execution. The SMA, however, has 
explicit mechanisms for reducing the bookkeeping 
overhead for data and program structure references. 
SMA executes a single instruction stream, whereas 
DAE requires two. 


The SMA “access mechanisms" eliminate most of 
the accessing overhead for well-structured computa- 
tions. Previous attempts with conventional archi- 
tectures, i.e., adding new address modes and 
including vector data types, have less effect. Use 
of cache memory actually increases accessing over- 
head both in time and in costly cache management 
hardware. Cache memory must be fast enough, and 
hence expensive, to overcome this overhead and pro- 
vide improvement. By adding just one additional 
VLSI processor chip to run the access process, SMA 


achieves high performance with conventional slow 
memory. Effective access prediction tolerates slow 
memory if adequate interleaving is provided; elim- 


inating overhead references can allow SMA to out- 
perform a conventional processor with fast cache 
memory. 


The SMA also includes a “loop mode" which 
eliminates the need to refetch instructions for 
each iteration of short loops. Finally, by physi- 
cally separating the two processes into a memory 
access processor (MAP) and a computation processor 
(CP) and allowing them to be loosely coupled, 
memory wait time is reduced and significant process 
overlap is achieved. Compilation of SMA programs 
is straightforward and well suited to the capabili- 
ties of conventional compilers. 


Little use is made of program structure to 
reduce addressing overhead in conventional 
machines. Our preliminary studies [Ples81] have 
indicated that a substantial amount of structure 
can be ascertained directly from a mechanical 
analysis of a program's address stream. Knowing, 
or at least accurately predicting, possible succes- 


sor memory references is very important in achiev- 
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ing an efficient access process and can signifi- 
cantly reduce the addressing overhead of a program. 
Additionally, exploiting this predictability leads 


to a more nearly autonomous operation of the access 


process and the computation process, thus permit- 
ting an overlapped execution of the two processes 
and a reduction of memory wait time. 


2, STRUCTURED MEMORY ACCESS MACHINE (SMA) 
ARCHITECTURE 


From the analysis of program address’ traces, 
it is possible to determine the control and data 
structures of a program and the mechanisms by which 
data structures are accessed. Instructions can be 
divided into blocks such that blocks are _ entered 
only at the first instruction and execution always 
proceeds sequentially to the last instruction. A 
block may have one, two, or more successor blocks. 
The control structure of a program is well identi- 
fied by a graph in which nodes are blocks and arcs 
point to successor blocks. This control structure 
can be found by mechanically analyzing the dynamic 
address trace of a program. Furthermore, if an 


instruction in a static listing always references 


the same memory location for its operand, e.g., 
with a direct addressing mode, that reference is 
called a scalar data reference; if more than one 
location is referenced by that operand reference as 
it recurs in a dynamic address trace, e.g., by 
using an index mode where the index value varies, 
the reference is called a data structure reference. 
This behavioral definition of scalar and structure 
tends to correspond in practice to common defini- 
tions. A sequential pattern of accessing through a 
data structure, as required for successive execu- 
tions of a data structure reference, e.g., a row 
major scan of the upper triangle of a matrix, is 
called an access mechanism. Sealar data, data 
structures, and access mechanisms can likewise be 
found mechanically from trace analysis. 


The Structured Memory Access (SMA) architec- 
ture uses this structural information to reduce 
access process overhead. Access process’ overhead 
exists in two forms. Address specification over- 
head refers to the increasing number of address 
bits needed to address a memory location as the 
address space becomes large. Most of these bits 
are redundant, given knowledge about possible 
address sequences. The second and more costly form 
of overhead, address calculation overhead, refers 
to address calculations explicitly performed by the 


CPU. Address calculation overhead involves some 
combination of extra instructions, parts of 
instructions, registers, memory accesses, and com- 


putation time. Both types of overhead are greatly 
reduced by the SMA architecture. Consider, for 
example, branch target addresses and operand refer- 
ences. 


In the SMA machine, the complete branch target 
address is specified in a branch instruction. How- 
ever, since the SMA machine provides instruction 
buffers to capture repeatedly executed instruction 
blocks, the number of times the branch instruction 
and its target address are accessed is reduced. 
The instruction buffer effectively limits the 
number of bits fetched from memory to specify 
branch target addresses. 


To reference scalars, the SMA machine provides 
a base register. A scalar reference specification 
is simply an offset to be summed with the contents 


of the base register to form an entire scalar 
address. Traces we have analyzed reference few 
distinct scalars for the computation process and a 


few bits are sufficient for the offset. 


The referencing of data structures is the 
prime cause of address calculation overhead. Con- 
ventional machines use bookkeeping scalars’ and 
instructions to manage iterative sequencing through 
data structures. To reduce this overhead, the SMA 
architecture uses special hardware to generate data 
structure references with minimal input to the 
access process from the memory. 


The SMA machine implements the function of 
index registers by using a hardware stack. This 
stack tracks the active indices of inner loops dur- 


ing program execution, and all data _ structure 
references are made by using a subset of these 
index values. To reduce access’ process input, 


tables in the SMA processor are used to store the 
base address of each data structure and other 
information necessary to generate an entire address 
from indices. These tables must be loaded before 
any instruction which uses them is executed. 
Depending on the amount of hardware allocated for 
the tables, the number of data structures, and 
their access mechanisms, the tables may only have 
to be loaded once at the beginning of program exe- 
cution. The tables may also be loaded during the 
execution of the program. A data structure refer- 
ence specification is thus a set of pointers to 
table entries. Such a scheme provides the neces- 
sary flexibility for generating access mechanisms 
while maximizing the rate of address generation 
through the use of pipelining techniques. 


Generally, the value of an index only needs to 
be associated with the access process. Thus, the 
stack containing indices, the tables for generating 
data structure references, and the address genera- 
tion portion of the CPU may be separated from the 
computation-oriented portions of the CPU. This 
partition divides the computer system into two pro- 
cessors: a ion processor (CP) and a memory 
access processor (MAP). The CP is used strictly 
for the computation process, i.e., the useful com- 
putations of the system; while the MAP is responsi- 
ble for the access process, i.e., generating all 
addresses for data and instructions. The index 
stack and the associated access tables mentioned 
above are kept in the MAP. Since only the MAP gen- 
erates addresses, it controls all transactions with 
the memory. 


There is no address bus between the CP and the 
memory since all memory requests are generated and 
controlled by the MAP. Also, since the CP is not 
responsible for addressing, the instructions sent 
to the CP contain no addressing information. Thus, 
the instructions are short and contain little more 
than opcodes and register tags. The CP is strictly 
devoted to performing computations and contains the 
ALU of the system; instructions and data are 
streamed into the CP by the MAP. The CP may 
receive entire blocks of instructions which it then 
holds in an internal instruction buffer. The CP 
may execute in loop mode by iterating over one or 
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of instructions in its buffer without 

The CP also has a set of registers for 
holding the scalars used by an instruction block. 
The internal instruction buffer and the registers 
are provided to eliminate some repeated memory 
accessing and its associated time and load on the 
MAP. 


The MAP, as shown in figure 2, has an internal 
Operand-Instruction Buffer (0IB) to hold its 
instructions and the operand specifications of CP 
instructions. The MAP can also operate in a loop 
mode fashion. Operation of the MAP is, to a great 
extent, independent of the CP. When the MAP begins 
receiving instructions, it forwards the MAP 
instructions and the operand specification portions 
of CP instructions to its OIB. The opeode = and 
register tag portions of CP instructions are for- 
warded to the CP instruction buffer. The MAP 
immediately begins generation of operand addresses. 
The operand addresses are then placed on the _ read 


more blocks 
refetching. 


or write queue of outstanding memory requests. 
Write data is produced in the same order as 
corresponding write addresses. Thus when write 


data is produced, it is paired with the appropriate 
queued write address. As soon as a read request is 
serviced, the operand returned by that request is 
forwarded to the CP or to the MAP tables. With 
such a scheme, reads are performed early, writes 
are done late, the CP concentrates on the useful 
calculations of a program, and the MAP is left with 
the important, but overhead-related, generation of 
addresses. 


sealars, 
Special 


The MAP accesses instruction blocks, 
and data structures with special hardware. 
instructions initialize and control the access 
mechanisms used for address generation. An SMA 
program thus contains a mixture of MAP and CP 


instructions. The data type of each operand is 
explicitly specified in each instruction. This 
extra information found in SMA instructions 


requires that the compiler must be capable of dis- 
tinquishing loop control (index) branching from 
data dependent branching, and scalars from data 
structures. 


Super computers and vector machines also con- 
tain special hardware for array referencing; how- 
ever, the programming of these machines quite often 
requires rearranging of an algorithm to suit the 
hardware, and program compilation is difficult. 
Furthermore, their structured data access mechan- 
isms are usually limited to a single vector of the 
structure at a time, i.e., "a constant stride" or 
constant step-size access mechanism with one index. 
Also, the same operation must be executed on each 
element of the vector, few vectors can be active at 
once, and access mechanisms are not’ easily 
suspended and resumed when more complex program 
loops are executed. The TI-ASC offers somewhat 
more flexibility by providing both inner and outer 
loop control for stepping though a matrix, i.e., 
two active indices. . 


The SMA machine provides more flexibility in 
the accessing of matrices since it offers more 
index levels by providing a stack on which to store 
indices. In our observations, even a 2-dimensional 
structure can require 3 levels of loop nesting for 
controlled rescan. Extra levels of nesting are 
also useful for providing nonconstant strides. 
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In vector machines, the vector access mechan- 
isms are explicitly coded into instructions and 
then recognized and set up during execution time. 
The SMA architecture is designed so that data 
structure access mechanisms are recognized as early 
as possible. Some accesses mechanisms can be set 
up as early as compile time or load time. This 
early recognition can lead to reduced run-time 
overhead. 


In conventional systems, the ALU makes. branch 
decisions. In the SMA, two types of branch deci- 
sions are distinguished: decisions based on pro- 
gram data, which are made by the CP, and those 
based on indices used for referencing data, which 
are made by the MAP. 


The SMA thus reduces the serial dependence 
which exists between the access process and the 
computation process. Since the MAP makes. branch 
decisions based on index values during the execu- 
tion of a loop, the MAP can generate memory 
requests for operands before the CP is ready to 
execute the instructions requiring those operands. 
In fact, the MAP should normally stay ahead of the 
CP so as to minimize the amount of time that the CP 
waits for data from memory. The MAP must wait for 
the CP when the MAP's read data queue is full or 
when the CP must resolve a computation-dependent 
branch. The CP must wait for the MAP when the 
MAP's write data queue is full, when the read data 
queue is empty, or when the CP instruction buffer 
does not contain the next instruction. 


The SMA organization described above is used 
to reduce the addressing overhead primarily by 
improving the accessing of data structures through 
efficient access mechanisms and prefetching. The 
process of accessing instructions can likewise be 
improved if information concerning the instruction 
block structure of a program, which is apparent in 
high-level source code, is kept with the program as 
it is translated down to machine level. Retaining 
the block structure of a program can be used advan- 
tageously to cause the CP to enter and leave loops. 
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In conventional machines, loop mode control is 
generated dynamically during execution. Upon 
recognizing a short backwards branch, it is assumed 
that the second iteration of a loop is about to 
begin. The instructions of the loop are refetched 
and trapped in the loop buffer where they remain 
for repeated execution until the loop ending branch 
is unsuccessful. 


The loop buffers in the CP and the 
trap loop instructions. 


MAP also 
However, loop mode control 
is set up at compile time. Loop structures are 
quite explicit and obvious in the high-level 
language source code available at compile time. If 
the instruction blocks which form the body of a 
loop are sufficiently short, they may all be stored 
in the instruction buffer at the same time. The 
processors thus are able to trap the body of a loop 
the first time the loop is executed. The loop 
buffers eliminate the need for repeated memory 
accesses for the same instructions during the exe- 
cution of a loop. In any case, repetition requires 
no data-dependent branch and no wait time. Execu- 
tion continuation after the loop is also efficient 
when the successor block is known, since it can be 
prefetched by the MAP during loop execution. 


This structured prefetching by the MAP from a 
conventional Slow memory with small processor 
buffers and without overhead references is felt to 
be superior to the unstructured use of a costly 


cache memory with attendant miss penalties, super- 
fluity problems, and access overhead reference 
cycles. 
3. AN SMA IMPLEMENTATION 

3.1. Data Types 

In this implementation, the SMA distinguishes 
among four types of operands: (1) immediate 
operands, (2) scalar operands, (3) data structure 
operands, and (4) index operands. Immediate 


operands are data whose values are embedded in an 


instruction. Scalar and data structure operands 
are defined as above. An index operand is one of 
the current indices found on the index stack. The 
index operand is used only to read a current index 
value from the index stack and transfer its value 
to the CP. An index operand differs from a_ scalar 
operand primarily in that the index operand ori- 
ginates from the MAP while the scalar operand ori- 
ginates from the memory. The operand type may be 
specified in a subfield of an instruction's operand 
field or it may be implicitly associated with a 
particular instruction. Additionally, an indirect 
addressing mode is provided specifically for use in 
the calling of subroutines and in the accessing of 


data items from structures such as linked lists. 
As with operand type specification, indirect 
addressing may be specified in a variety of ways. 


At some time it may be desirable to distin- 
guish explicitly among several types of data struc- 
tures. Instead of having a data structure operand, 
one may wish to have an array operand, a linked 
list operand, a binary tree operand, etc. For each 
operand type, some special accessing mechanisms 
would be provided to improve the speed with which 
an operand address is generated. Accessing mechan- 
isms as implemented allow the instruction code _ to 
reference structured data simply by pointing to the 
mechanism, which then references the next data in 
the established pattern. 


3.2. Scalar Data 


Sealar data is treated in the manner of a vec- 
tor rather than as a set of disassociated items. 
The specification for a scalar operand includes a 
specification of a MAP base register and a dis- 
placement into a scalar data area in memory. The 
MAP can have more than one scalar base register to 
aid in the accessing of local and global variables, 
such as during subroutine calls. Such a base 
register can be used as the argument pointer set 
during a subroutine call. For the programs we stu- 
died, the number of simultaneously active scalars 
is relatively small, particularly in an SMA pro- 
gram. 


3-3. Index Operations 


The SMA'ts memory access processor has’ special 
mechanisms to track indices for data structure com- 
putations. These indices are used to generate’ the 
addresses for specific items of the data structure 
to be referenced. An index is specified by its 
current value, final value, step-size and indexing 
level. When the index is first established, the 
current value is equal to its initial value. At 
any time, several indices may be active; and the 
level, or nesting, of these indices is dictated by 
the time at which they were instantiated, or set 
up. In the SMA, the current value, final value, 


and step-size of an index are kept ona LIFO stack 
structure known as the index stack (IS). Each 
stack position is numbered sequentially, with the 


bottom of the stack numbered level 1. This conven- 
tion provides a convenient way of referring to the 
current value of any active index because the bot- 
tom of stack entry corresponds to the outermost 
level of nesting, i.e., level 1. Stack continua- 
tion in memory can be provided for overflow. 
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When a “setup index" instruction is executed 
by the MAP, the initial value, final value, and 
step-size are pushed onto the stack. To change the 
current value of an index, an "increment index" 
instruction is used. This instruction must specify 
three items: the level of the index which is being 
incremented, and two initial addresses of blocks 
which are the targets of a branch outcome. If the 
current value of the index is less than the final 
value, control is transferred to the first block 
which is specified. If, on the other hand, the 
current value equals or exceeds the final value, 
control is transferred to the second specified 
block. 


By checking the index value early, during each 
increment index instruction, and by having the 
branching information available, the next instruc- 
tion can start being accessed while the CP is still 
performing the final computations of a loop. 
Furthermore, no guess is made about which direction 
an index-based branch will take, thus no time is 
wasted in fetching potentially unnecessary blocks 
of instructions from the main memory. 


When the current value of the index has 
reached its final value, that index should be at 
the top-of-stack and it is removed (popped) from 
the stack. Two other methods of removing indices 
from the stack are (1) the "remove index" instruc- 
tion which removes the highest level current index 
from the top of the stack and (2) a "clear all 
indices" instruction which removes all indices from 
the stack. 


To save loading of index instruction parame- 
ters from memory, The MAP is loaded with a set of 
templates for these values at the start of program 
execution. A template is a specification of the 
values needed to initialize an index on the index 
stack. Templates are loaded into an jndex template 
table. When an index is set up in the IS, the IS 
is loaded directly from the index template table. 
For a particular program, the number of distinct 
templates could be fairly small. For example, 
analysis of a Gaussian elimination program _ shows 
that 995 dynamic index setups are required, 
representing 16 static index setups, but only 8 
templates are needed. Each index activated with a 
particular initial specification can use the same 
entry from the index template table. Even if the 
number of templates exceeds the table size, judi- 
cious reloading limits overhead. 


3.4. Data Structure Accessing 


To access data structures in the SMA, one must 
combine index values to form a data address. In 
the SMA, information for forming proper  combina- 
tions is stored in two data tables within the MAP. 
As with some of the other repeatedly used informa- 
tion, the contents of the tables may be loaded when 
the program begins execution. The two tables are 
the access pattern table (APT), which indicates the 
index levels to be used, and the access information 
table (AIT), which contains information about data 
structures. 


Each line of the APT is divided into several 
dimension fields. Each dimension field is divided 
into 2 subfields. The index level subfield indi- 
cates which level of the index stack (IS) is asso- 


ciated with that dimension field. The offset sub- 
field contains the value of a small positive or 
negative offset to be added to the index before the 
index is used. This feature is useful since quite 
often the index of a data structure access is an 
existing presently active index, plus or minus a 
small constant. An entry in the APT may be used by 
more than one data structure since the information 
is not altered during execution and does not depend 
on accessing a specifie data structure. 


For each data structure currently being used 
by the program, there is an entry in the AIT. If 
the number of data structures in a program is. suf- 
ficiently small, the AIT need only be loaded at the 
beginning of program execution. Each entry of the 
AIT is composed of three types of values: (1) the 
base address of the data structure, (2) a displace- 
ment for each dimension of the data structure, and 


(3) an upper bound for each dimension of the data 
structure. 
A data structure reference may be made by 


specifying an entry in the APT and an entry in the 
AIT. A data structure address is generated by sum- 
ming the base address in the AIT entry and the 
index terms for each dimension. Fach term is 
formed by adding the offset in the APT entry to the 
index value identified by the level in the APT 
entry and multiplying by the displacement in the 
AIT entry. Bounds checking can be performed by 
comparing the index terms with the bound in the AIT 
entry. 


While this computation may be tedious to per- 
form for each data structure access, hardware must 


be present in the MAP to perform this computation 
at least occasionally. Once the hardware is 
present, it can be pipelined at little additional 


cost to allow the straightforward solution of per- 
forming this computation for every data structure 
reference. Pipelining allows a high rate of 
address generation; effective prediction overcomes 
the pipeline delay and allows the MAP to remain 
ahead of the CP. 


3.5. Control Issues 


The instruction fetcher in the MAP is respon- 
sible for generating instruction requests. The 
instruction fetcher sends the instructions it 
receives from the memory to the instruction prepro- 
cessor. The instruction preprocessor forwards por- 
tions of instructions to the CP with operand 
specifications replaced by buffer tags. MAP 
instructions and operand specifications are placed 
in the Operand-Instruction Buffer (OIB). The 
address generation unit steps through the MAP 
instructions and operand specifications found in 
the OIB, executes MAP instructions, generates 
operand addresses, and forwards each operand 
address to the read or write queue. When the 
memory returns the data associated with addresses 
in the read queue, that data is sent to a FIFO 
buffer in the CP. The CP sends data to the MAP for 
the write queue. Addresses in the write queue 
which have received their associated data are ser- 
viced by the memory. 


The CP has an instruction buffer to hold the 
instructions it receives from the MAP. An execu- 
tion unit in the CP steps through the instruction 


buffer, executing instructions one by one. If an 
instruction needs a data item from memory, that 
data is found at the head of the FIFO buffer. If 
the buffer is empty, execution is suspended until a 
data item is received from the MAP. Along with 
each data item, the CP receives an additional bit 


from the MAP which is used as an end-of-data sig- 
nal. Assertion of the end-of-data signal in loop 
mode indicates that execution of the current 


instruction loop is to terminate and that the CP 
should begin execution of another block found in 
its instruction buffer, or wait until a new 
instruction block arrives from the MAP. The CP 
generates write data and signals the MAP regarding 
success or failure of CP tests for data-dependent 
branches performed in the MAP. 


A program begins execution by having the moni- 
tor or operating system jump to the beginning of 
the program; that is, the operating system sets the 
program counter (PC) to the starting address of the 
program. In the SMA machine, the PC is located in 
the instruction fetcher of the MAP. When the PC is 
set to the beginning of the program, the instruc- 
tion fetcher generates requests for instructions 
from the memory. Instruction requests are gen- 
erated until the end of a block is encountered. If 


the instruction at the end of a block is a branch 
instruction, the instruction fetcher suspends 
operation until the branch is resolved. An end- 


of-block bit, attached to each instruction, indi- 
eates the last instruction of each block. The 
Starting address of each block, as found in the PC, 
is saved to be used later when checking whether a 


-bloeck loops upon itself or branches to some other 


block in the OIB (and in the CP 


buffer). 


With the information stored in the OIB, the 
address generation unit can generate all the data 
requests required by a program. As the OIB is 
loaded, the address generator can begin executing 
MAP instructions and generating operand addresses 
by stepping through the entries of the OIB with its 
own internal program counter. 


instruction 


The addresses in the read and write queues are 
kept in the order that they were generated so that 
the CP receives read operands in the expected order 
and the MAP receives write data in the proper 
order. If write data is soon read back from 
memory, it is possible that the address of that 
data item will appear in both queues at the same 
time. Each time- a read address is placed on the 
read queue, the write queue must be checked for an 
outstanding write to that address in order to 
prevent the reading of invalid data. If a match 
occurs, the read must not be permitted to occur 
before the write; otherwise, reads have priority 
over writes. 


3.6. Branching 


When the instruction fetcher of the MAP 
reaches the end of a block and the instruction is a 
branch, the instruction fetcher suspends further 
sequential instruction requests. If the branch 
depends on a condition in the CP, a signal must be 
received from the CP before the instruction fetcher 
and the address generator can resume operation. 
This signal indicates the success or failure of the 
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branch. If the branch, however, depends on_ the 
value of an index in the index stack, the branch is 
resolved in the MAP. Thus, if the result of an 
index-dependent branch requires executing a new 
block of instructions, the instruction fetcher can 
begin fetching the instructions of the new block 
while the CP is performing calculations on the data 
for a previous’ block. The address generator can 
even begin making data requests for the new block 
while the previous block is still executing in the 
CP | 


At any one time, the CP's instruction buffer 
may contain the CP instructions for more than one 
instruction block. The OIB in the MAP must, at the 
same time, be capable of holding the accessing 
information and MAP instructions corresponding to 
the instruction blocks in the CP buffer. The CP's 
instruction buffer and the MAP's OIB, while they 
hold information for the same number of blocks, are 
not necessarily the same size since corresponding 
CP and MAP blocks themselves differ in size. Moni- 
toring the amount of information held by both the 
buffers is the responsibility of the instruction 
preprocessor since the instruction preprocessor 
fills the OIB and forwards CP instructions to the 
CP. 


When a branch is resolved, there is a _ chance 
that the target block of the branch is already 
resident in the OIB and the CP's instruction 
buffer. The address generator checks for this 
situation by comparing the branch target address 
against the saved first address of each block 
currently found in the OIB. If there is a match, 
the address generator can immediately begin genera- 
tion of data addresses for the new block. If, on 
the other hand, the information for the block is 
not in the OIB, the instruction fetcher is signaled 
by the address generator that a new block must be 
fetched. In such a case, the address generator 
must wait until new MAP instructions arrive in the 
OIB. When a branch is resolved, the MAP must sig- 
nal the CP which one of the following three branch 
options the CP should take: (1) continue repeated 
execution of the currently executing block, (2) 
execute some other block found in the CP's instruc- 
tion buffer, or (3) expect to receive a new 
instruction block from the MAP. When an entire 
block does not fit in the instruction buffer, it 
may be streamed through, but loop mode is not pos- 
sible. 


Normally, the CP is in a loop mode type of 
operation and expects a stream of data from the 
MAP. That is, if the end of the currently execut- 
ing block is not a branch which depends on data in 
the CP, the execution unit of the CP will re- 
execute the currently executing block as long as 
the CP receives data from the MAP and the _ end-of- 
data signal is not set. This mode of operation is 
especially well-suited for executing an instruction 
block which operates on an array. Since the number 
of times such a loop is executed depends on the 
size of the array and the value of indices in the 
IS, branches will occur in the MAP based on values 
in the IS. The only effect these branches have on 
the CP is that data continues to be supplied to the 
CP until the loop terminates. 


If the MAP determines that branch options 2 or 
3 are to be followed, an active end-of-data flag is 
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sent to the CP on the read data queue after the 
last data item associated with the currently exe- 
cuting block. The value of the data word sent with 
an active end-of-data signal informs the CP whether 
option 2 or option 3 is followed. One data value 
is reserved to indicate that the CP should expect a 
new block from the MAP (option 3). Any other data 
value is a pointer to a block in the CP's instruc- 
tion buffer (option 2). Thus, program execution in 
the CP is controlled through the read data stream 
and the CP checks for the end-of-data signal on 
each read data queue access. 


When the CP performs the test for a data- 
dependent branch, the MAP ceases prefetching data 
until the branch is resolved. This wait time 


incurred by the MAP is undesirable when such a test 
is executed frequently and a particular outcome is 
expected. Instead, the wait time could be used to 
prefetch the data for the likely branch target. 
The end-of-data signal provides a convenient way of 
disposing of data wrongly prefetched by the MAP. A 
reserved data value, sent with the end-of-data sig- 
nal, could signal the CP to purge all buffered and 
incoming data until the next end-of-data signal. 
Such a reserved value would be written by the MAP 
into its read buffer whenever the MAP continued 
prefetching data and received a wrong-way branch 
indication from the CP. This signaling capability 
would be allowed only by special CP branch instruc- 
tions whose opcodes would instruct the CP to purge 
data upon a wrong-way branch. All data in the CP 
read buffer is then purged up to the "purge" end- 
of-data signal and all following data is purged up 
to the next end-of-data signal. Prefetching 
instructions in such a case has no purge problem 
since the next end-of-data signal after the "purge" 
end-of-data signal indicates which instruction 
block to execute next. 


The methods for communication between the MAP 
and the CP are designed to limit the number of 
interruptions in execution due to branching. 
Branches which depend on data in the MAP may occur 
many times without interrupting the operation of 
the CP; therefore, once the CP has a block of 
instructions in its buffer, the MAP can keep a 
stream of data flowing into the CP. 


3.7. Subroutine Calis 


The SMA uses a control stack for handling sub- 
routine calls. A stack pointer, frame pointer, and 
an argument pointer are used as in the Digital 
Equipment VAX system. These pointers are main- 
tained in the MAP, and MAP instructions are pro- 
vided to access the pointers and to push and pop 
the SP. 


4. SMA EVALUATION 


The effectiveness of the SMA machine in reduc- 
ing addressing overhead has been evaluated by com- 
paring an SMA machine's performance to that of a 
VAX-like machine, primarily with respect to the 
execution of a Gaussian elimination algorithm 
(GAUSS). Some other evaluations are mentioned 
briefly. GAUSS, written in FORTRAN, is taken from 
[SSPP68]. From the high-level program source, the 
program is compiled into assembly language for a 
VAX running the UNIX operating system and for the 


example SMA machine. To compile the program into 
SMA assembly language, the VAX assembly listing 1s 
modified only with respect to the way data 
referencing occurs. That is, when a matrix is 
being accessed, SMA instructions are added to setup 
the indices for the matrix and to increment these 


indices. These SMA instructions, however, elim- 
inate the need for some of the variables used and 
calculations performed by the VAX. Care is taken 


not to give either machine any special advantages. 
Thus, the code produced for the SMA by this 
transformation of VAX machine code is not hand 
optimized to any extent. 


4.1. Number of Memory References Generated 


A program's instruction blocks can be identi- 
fied from the high-level source. Figure 3 is a 
diagram of the control flow for GAUSS in terms of 
instruction blocks. For GAUSS, only two of the 
branches are probabilistic in the sense that they 
are truly data dependent. Each of the other 
branches in the program are determined by the value 
of an index. These and the unconditional branches 
are handled very well by the MAP of the SMA 
machine. 

The results of a static analysis of GAUSS are 
shown in Table 1. In the SMA version, GAUSS 
requires fewer than half the instructions needed in 
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Statistics from a static analysis of 
GAUSS, EIGEN, and QSORT. 


Table Ney 


GAUSS 


EIGEN QSORT 


Number of 


-}] VAX SMA VAX SMA VAX SMA 


instruction blocks 14 18 
distinct scalars 8 8 
distinct data 

structures 1 1 
access patterns 1 1 
instructions 68 59 
data references 61 62 

scalar 54 53 

data structure T T 


index 


the VAX version. 
tions, MAP instructions are also 


When counting the SMA instruc- 
included in the 
total number of instructions. The difference in 
the number of data references is as dramatic as the 
difference in the number of instructions. Since 
the VAX and the SMA versions of GAUSS make the same 
number of data structure references, the difference 
in data referencing is due to the scalar refer- 


ences. The SMA programs have fewer distinct 
scalars than the VAX programs due to _ overhead 
reduction; thus, the VAX program not only has more 


sealars but also performs overhead instructions to 
operate on these scalars. 


The static differences between the VAX and the 
SMA versions of GAUSS translate directly into sub- 
stantial differences in the dynamic count of the 
number of memory references for each program. To 
obtain this dynamic count for GAUSS, the number of 
memory references generated by each block is calcu- 
lated as a function of n, the matrix size. For 
data dependent branches, successors are chosen to 
produce a path with the largest number of instruc- 
tions and data references. Thus in GAUSS, pivoting 
is always done and a singular matrix is not encoun- 


tered. Therefore, this is a worst case dynamic 
memory reference analysis. 


In this analysis, it is also desirable to see 
what effects loop mode has on the number of data 
references. Thus for each machine there are two 
eases: one with loop mode and one without loop 
mode. For GAUSS, blocks (n, it, j'), blocks 
(qa, k', r), and blocks (c, d, 1) are considered 
inner loops for loop mode execution. To hold each 
set of these blocks in a loop buffer, a hypotheti- 
cal VAX with loop mode added would need to provide 
a buffer of 24 instructions, while the SMA needs a 
buffer of only 8 instructions. 


Figure 4 shows a plot of the dynamic count of 
the total number of memory references required by 
the GAUSS program for a VAX and an SMA machine with 
and without loop mode as a function of n. The SMA 
machine always makes fewer memory references’ than 
the VAX, even if the VAX has a loop mode. The 
number of memory references needed by an SMA 
machine running the GAUSS program on a 100 x 100 
matrix is only 20% of the number of memory refer- 
ences made by the VAX without loop mode. Thus the 
SMA with slow memory can easily outperform a VAX 
with a 1 clock cache cycle. 


Log base 18 of the number of memory references 


Figure 4. 


VAX no loop mode 


—A— VAX loop buffer = 24 
—O— SMA no loop mode 
—E} SMA loop buffer = 8 


Matrix size - n 


Log of the number of memory references 


for GAUSS for an nxn matrix 


A similar analysis was performed on an 
eigenvalue-finding algorithm (EIGEN), and a quick- 
sort algorithm (QSORT). EIGEN is written in FOR- 
TRAN: and is the HQR routine from the Eispack sub- 
routine package [Smit74]. QSORT is a recursive 
program written in PASCAL and based on an algorithm 
from Horowitz and Sahni [Horo76]. 


In a static count, the SMA version of EIGEN 
requires fewer than half the instructions and only 
approximately 75% of the data references of the VAX 
version. A dynamic analysis indicates the SMA 
machine without loop mode generates approximately 
the same number of memory references as the VAX 
with loop mode. For the EIGEN program operating on 
a 100 x 100 matrix, the SMA with and without loop 
mode makes only 30.5% and 47.2%, respectively, of 
the references made by the VAX without loop mode. 
In this analysis an "instruction" represents an 
opcode or a memory reference. Thus a VAX instruc- 
tion with two memory reference operands would count 
as three of these simple instructions. 


A static analysis of QSORT reveals little 
difference between the VAX and SMA versions. 
Nevertheless, the SMA machine reduces dynamic 


memory references for QSORT approximately as much 
as for EIGEN. 


4.2. An Estimate of Relative Performance 


A program's execution time can be partitioned 
into 1) time spent accessing memory and 2) time 
spent computing. The execution time is reduced by 
overlapping these two quantities. Generally, it is 
not possible to overlap all of the computation time 
with memory referencing activity. We call unover- 
lapped computation time the computational overhead. 
By allocating a portion of this computational over- 
head to each memory reference, the execution time 
of a program can be expressed as: 


T=:=M (1/v + c) 
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where M is the number of memory 


references, c is 
computational overhead, and the term 1/v is the 
effective amount of time needed per memory access. 
The variable v is the memory bandwidth and is 


included as a parameter so that comparisons can be 
made between machines whose memory speeds differ. 
A larger v represents a faster memory and, there- 
fore, a reduced memory access time. If the memory 
is interleaved, v takes the interleaving factor 


into account. The same algorithm executed on dif- 
ferent machines will yield a different execution 
time because the term M will vary from machine to: 


machine, as will the term c. 


The computational overhead, c, is difficult to 
measure. It varies from one program to another and 
also from one machine to another. Due to machine 
dependencies, different models of even the same 
machine will have different values of c. The value 
of cis also a function of the memory bandwidth v. 
As memory access time decreases (increasing v), 
less computation time can be overlapped with memory 
accesses, causing c to increase. Due to these 
dependencies, c is treated as a parameter in our 
comparison of performance. 


The performance of a machine is given by the 
inverse of the execution time. We calculated the 
performance of conventional machines and an SMA 
machine for c ranging from 0 to 2 and for v taking 
on values of 1, 2, 4, and 8. The computational 
overhead is in units of standard memory cycle times 
per memory reference, as is the term 1/v. The fac- 
tor M is taken from the dynamic analysis of GAUSS 
run on a 100 x 100 matrix. To aid in comparing one 
machine with another, performance is normalized to 
the performance of a conventional machine with no 


computational overhead (c=0) and a memory bandwidth 
of one (v=1). 


The normalized performance for GAUSS is shown 
in figure 5. Machines with and without loop mode 
are treated separately because the presence of loop 
mode affects the number of memory accesses required 


6x6 ¢ 2-408 | 4. O48 1. e 4S | 4-8 ay 
no loop w/ loop no loop w/ loop 
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c=2 

Figure 5. Normalized performance for GAUSS (v = 
memory bandwidth, ec = computational 
overhead, T = run time) 

by the program. Once loop mode is’ established, 


memory requests for the instructions of the loop 


are not needed. 


Each vertical line of the graph represents the 
relative performance of a machine with a particular 


memory bandwidth and with the computational over- 
head ranging from 0 to 2. A conventional machine 
with no loop mode, v=1, and c=1 would require 


approximately twice as much time to run a Gaussian 
elimination program on a 100 x 100 matrix as_ the 
machine with c=0. At the other extreme, a c=0 SMA 
machine with loop mode and a memory bandwidth of 8 
would perform approximately 42 times better than 
the base machine: conventional, no loop, v=1, and 
e-0. 


The performance of the base machine was also 
compared to an SMA machine running EIGEN and QSORT. 
In this comparison, there is a great improvement in 
performance when an SMA machine is used; however, 
the improvement is not as dramatic as for GAUSS. 


For all of the three programs and a given 
memory bandwidth, a conventional machine with loop 
mode and an SMA machine without loop mode _ perform 
almost equally. Furthermore, performance is sensi- 
tive to changes in computational overhead, espe- 
cially when ec varies from 0 to 1. Different 
machines should not simply be compared with the 
same value for c. 


5. CONCLUSIONS 


5.1. Summary of Results 


Due to the von Neumann bottleneck, inefficien- 
cies exist in the way address generation is per- 
formed in most conventional machines. The research 


presented here has studied the access process to 
discover where the access inefficiencies lie and 
how they can be reduced. 


From a detailed analysis of program address 
traces [Ples82] we determined the types of features 
that should be included in a machine designed to 
generate memory references efficiently. The pro- 
posed Structured Memory Access (SMA) machine con- 
tains these features. The SMA machine is divided 
into a computation processor (CP) and a _ memory 
access processor (MAP). As their names imply, the 
CP is responsible for the desired computations of 
the system, while the MAP generates all the memory 
references for a program. The SMA machine reduces 
addressing overhead by providing special access 
mechanisms in the MAP to generate references effi- 
ciently for blocks of instructions and several data 
types. The storing of bounds information permits 
bounds checking to occur automatically in hardware 
when data structures are accessed. Because of the 
system's organization, the CP and MAP can operate 
relatively independently of one another. In par- 
ticular, prefetching of instructions and data, and 
index-dependent loop control are inherent features 
of the SMA machine. 


The operation of the MAP and its 
with the CP were discussed as were the types of 
access mechanisms which reside in the MAP. The 
machine's ability to reduce addressing overhead was 
then evaluated. A comparison was made between a 
hypothetical SMA machine and a VAX-like machine 
with respect to the number of memory references 
generated by a set of programs. Depending on the 
program, the SMA machine reduces the number of 
memory references to between 1/5 and 2/5 of those 
required by a conventional VAX. 


interactions 


The performance of the SMA machine was’ then 
evaluated. A machine's performance was parameter- 
ized by the memory bandwidth and the computational 


overhead. It was found that performance is very 
sensitive to these parameters; however, an SMA 
machine performs significantly better than a con- 


ventional machine with the same parameters. Now 
that the SMA concept has been justified, more 
detailed performance evaluation and design modifi- 


cations are being carried out in our continuing 
research. 
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Abstract -- This report describes 
SIMAC, a new SImple Multiple Alu Com- 
puter for exploiting low level parallel- 


ism. The main feature of SIMAC is that 
all scheduling of operations to be exe- 
cuted in parallel is done at compile 
time. This results in much simpler 
hardware. Performance is increased 
because the overhead of scheduling is 
done only once and not repeated each 


time the program is’ executed. Prelim- 
inary results indicate that significant 
Speedups are possible on both conven- 
tional sequential programs and on numer- 
ical array processing problems. 


Introduction 


Given a sequential program for a 
particular algorithm on a computer, 
there are three general approaches’7 that 
can be taken to speed up the execution 
of the program. The execution speed of 
individual operations can be increased 
by using faster hardware, the program 
can be broken into two or more indepen- 
dent tasks which are run simultaneously 
on more than one processor (high level 
parallelism), or more than one- simple 
operation can be performed simultane- 
ouSly (low level parallelism). 


In recent years there have _ been 
several studies investigating how much 
low level parallelism exists in algo- 
rithms expressed as computer programs. 
The results of the studies have varied 
considerably. Tjaden and Flynn[{14] 
among others have measured the parallel- 
ism available within basic blocks. (A 
basic block iS a Sequence of instruc- 
tions without any conditional jumps.) 
These studies have found a speedup of 2 
to 3, where speedup is defined to be the 
sequential execution time divided by the 
parallel execution time. In a more 
recent study Nicolau and Fisher [9] 
report that by exploiting global paral- 
lelism, speedups approaching 18088 are 
possible, and factors of 1@ or more were 
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found in most programs’ tested. The 
speedups reported by Fisher assumed 
unlimited hardware, but they still point 
out that there is a substantial amount 
of parallelism in typical programs’ that 
is currently not being exploited. 


Pipelining is one method that is 
used widely today to exploit a small 
amount of the potential low level paral- 
lelism within a = program. Except on 
large machines like the CDC 6688 or CRAY 
I, or special purpose vector processors 
like the IBM 3838, pipelined computers 
only exploit parallelism by overlapping 
the separate phases of the inStruction 
fetch-decode-execute cycle. The pipe- 
lines of these large or special purpose 
machines are capable of pipelining the 
execution of various numerical computa- 
tions. In general, pipelining is only 
used for the overlap of repeated execu- 
tions of the same operation on different 
data. There is much more low level 
parallelism that is not handled by trad- 
itional pipelining techniques, and is 
only touched on slightly by some very 
large machines. SIMAC allows’ exploita- 
tion of more of this low level parallel- 
ism on a Simple computer. 


Another common form of low level 
parallelism found today is the parallel- 
ism found in horizontally micropro- 
grammed computers. Unfortunately this 
Cannot be easily used since it is in 
general below the level accessible by 
the user. Some work is being done in 
compiling high level languages directly 
to microcode[12} and in migrating common 
high level operations into micro- 
code[6,1@] to attain speed improvements. 

Data flow computers are another 
attempt to exploit low level parallel- 
ism. In general, data flow computers 
require very complex hardware to control 
the parallel operation of the multiple 
processing elements[15]. This iS an 
active area of research. 


There is a group of researchers 
today advocating Simple or reduced 
instruction set computers[11,5]. This 


is in contrast to the more traditional 
trend toward more complex instruction 
set computers aS can be Seen in the DEC 
VAX11 or INTEL'sS iapx432. The develop- 
ment time for a new computer is directly 
proportional to the complexity of the 
architecture. In order to make maximum 
use of today's rapidly changing hardware 
technology, it is important to keep the 
development time short. RISC architec- 
tures are simpler and can therefore be 
implemented very rapidly using the 
latest in hardware technology. Another 
important factor in favor of RISC archi- 
tectures is the large difference in com- 


munication speeds between VLSI chips 
versus the speeds within VLSI chips. 
With a RISC architecture it is possible 


to put one or more entire processors on 
a Single chip greatly reducing the 
amount of communication off of the chip. 


Multiprocessor systems are being 
developed that exploit high level paral- 
lelism by running Separate tasks. in 
parallel, each on a different uniproces- 
sor. This still leaves open the oppor- 
tunity for exploiting more low level 
parallelism in the uniprocessors’ that 
make up the multiprocessor systems. 


SIMAC is a computer system which 
uses simple hardware and sophisticated 
compiler techniques to exploit the low 
level parallelism which exists in a wide 
range of programs. SIMAC is ae Single 
instruction stream multiple data stream 
computer. The processing elements’ (PEs) 
are identical and communicate through a 
Shared memory and a set of Shared regis- 
ters. Each PE contains a simple ALU plus 
logic to issue reads and writes to 
memory. The PES are controlled by a 
master control processor. (CP) which 
fetches instructions from memory and 
issues instructions to the PEs. SIMAC 
differs from existing machines with low 
level parallelism (eg. CDC 6608/7690) in 
that all scheduling is done at compile 
time. 


SIMAC Architecture 


Low level operations are statically 


scheduled for each PE by software at 
compile time. These Statically 
scheduled parallel instructions allow 
for parallel execution of low level 
operations on each processor. By moving 
the scheduling task into the _ software 
the hardware has been kept’ simple. 


Although hardware costs may be dropping, 
this does not necessarily imply that the 
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hardware should be more complex, only 
that there should be more hardware. By 
keeping the architecture of SIMAC) sim- 
ple, the cost of the processor will be 
low and therefore many SIMAC’ processors 
could be combined into a larger mul- 
tiprocessing system. 


System Architecture 


The SIMAC architecture is indepen- 
dent of the number of PES. The number 
of PES is limited only by the use of the 
registers and the bandwidth of the 
memory. There is nothing in any of the 
instructions that depends on the number 
of PEs. This is not the same as_ the 
ability to simply add PEs to the system 
at any time. For any particular imple- 
mentation the number of PES’ will be 
fixed. The implementation of the memory 
controller, register switch and CP all 
depend upon A block 
diagram of is shown in 
figure l. 


the number of PEs. 
the 


processor 


REGISTER 
ARRAY 


REGISTER SWITCH 


EA Glog 


CONTROLLER 


MEMORY 


MEMORY 


(PE - Processing Element) 
(CP - Control Processor) 


Figure 1. Block Diagram of SIMAC. 
Each PE in SIMAC is ae load/store 
RISC({11] style processor. Each PE is 


capable of executing a small set of sim- 
ple operations (arithmetic, logical and 
shift, and test). All branching opera- 
tions are processed by the CP. All MOPs 
(machine operations) are encoded in 32 
bits and operate only on registers with 
the exception of load and store. 


SIMAC has only one program counter 


(PC) maintained by the CP hence a Single 
instruction stream (istream). However 
each instruction (PI or parallel 


instruction) may consist of up to n (the 
number of PEs) unrelated machine opera- 
tions. Another way of viewing the 


multiple instruction capability is to 
think of this as a microprogrammed 
machine which is capable of accepting 


varying width horizontal instructions. 
The instruction width can be any multi- 
ple of 32 bits up to the maximum size 
determined by the number of PEs. The 
instruction width is controlled by a bit 
in each MOP. 


registers. 
32-bit 
PES. 


local 
purpose 
all 


The PES have no 
There are 32 general 
registers that are shared by 


Each register may be read by any or all 
PE's each cycle. AlsSo each PE may write 
one register per cycle. It is the 


responsibility of the software to insure 
that there are no resource conflicts 
when more than one PE is active. 


Similar to a variable 
width, horizontally microprogrammed 
machine. With 1 PE active, SIMAC simply 
processes a Single 32 bit MOP every 
cycle. With n>l PES active, SIMAC exe- 
cutes a Single 32 bit MOP on each of the 
n PEs. Each PE gets a different 32 bit 
MOP. PE@ gets its MOP from M(PC), PEIl 
gets its MOP from M(PC+l1), PE2 gets its 
MOP from M(PC+2) etc., and the PC is 
incremented by n. In this way n_ dis- 
tinct parallel istreams are being exe- 
cuted with the non-trivial restriction 
that all n istreams must branch 
together. This is insured by the 
software which only allows one of the n 
MOPs for a Single PI to contain a branch 
type MOP. 


SIMAC is 


Each PE in SIMAC has two bits, T 


(test) and V (valid) that are used for 
branching. Whenever a PE executes a 
test instruction (eg. tlss, test less 


test equal), the V bit in 
that PE is set. In addition the T bit 
is set if the test is true and cleared 
if false. The actual conditional branch 
is then executed by the CP when it 
encounters a branch insStruction. There 
are two types of conditional branch 
instructions, BAND (branch anding) and 
BOR (branch oring). BAND and BOR 
instructions clear all V bits whether or 
not the branch is taken. A BAND branch 
is taken if the logical "and" of all 
test instructions executed since the 
last branch is true. A BOR branch is 
taken if the logical "or" of all test 
instructions executed since the last 
branch is true. More formally a BAND 


than and teq, 


nat 
he 


branch is taken if the following is 
true: 
(Tl or ~Vl) and (T2 or ~V2) and... 


A BOR branch is taken if the following 
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Processing 


is true: 


(Tl ‘and’ Vi) cor’ (T2--and V2): (or: «a. 


This allows for rapid, parallel evalua- 
tion of high level branches such as: 


IF ( A> Bandc <= D ) then ... 


Assuming A, B, C, and D are registers, 


then the above could be executed by the 

following two parallel instructions: 
Pils: “tgtr A,B: 7 tleq Cyd: -/ sce 
PI2: band label / .«.« 


By uSing boolean algebra, testS can _ be 
transformed into a sequence of comparis- 
ons “ored" together or a Sequence of 
comparisons "anded" together. 


Control Processor Architecture 


The Control Processor (CP) contains 
the program counter and executes all 
branching type operations. The CP is 
also responsible for requesting the 
instructions from the memory controller 
and dispatching the MOPs’ to the PEs. 
The Control Processor (CP) iS responsi- 
ble for the following tasks: 


- Maintain the program counter. 


- Perform branch operations uSing the 
T and V bits from each PE. 


- Fetch instructions from memory. 


- Issue non-branch MOPs to the PEs, 


Element Architecture 


Bach PE is an extended ALU capable 
of doing register to register arithmetic 
and doing Simple memory accesses. The 
instructions are sent to the PE from the 
CP. 


The instructions are broken into 
five groups, Simple register to register 
instructions (eg. add, shift), complex 


register to register instructions (ie. 
multiply, divide), load/store instruc- 
tions, test, and miscellaneous. The 


miscellaneous, test, and simple register 
to register instructions all execute in 


one cycle, the number of cycles for the 
others will vary. 
Bit 31 of each MOP, called the P 


bit, is used to pack MOPs into parallel 
instructions (PIs). All MOPS within a 
Single PI are executed in parallel. The 
assignment of MOPs to PEs is determined 
by the position in the PI. The leftmost 


MOP is executed by PE@ the next by PEI 
etc. The last MOP of each PI may be a 
branch operation which is executed by 
the control processor. The P bit of the 
last MOP in each PI will be clear and 
the rest will be set. For example a 
128-bit PI would look something like: 


127 --- 95 --- 63 --- 31 --- @ 
Dschee, bites 2. Ga 02 mee 


Register switch 

The register switch provides simul- 
taneous read access to all registers by 
all PEs and allows more than one PE _ to 
reference the same register. This can 
be implemented as a large selector 
Switch. For a fixed number of registers 
the number of transistors for the switch 
grows linearly with the number of PEs. 
This would require approximately 12,9089 
transistors per PE. Assuming VLSI chips 
with 250,800 transistors, the selector 
circuit for 4 PEs would utilize less 
than 20% of the chip area. 


Memory controller 


The memory controller accepts data 
read/write requests from all PES and 
instruction read requests from the CP. 


This is a potentially critical section 
of the overall system design due to’ the 
possibility that the memory bandwidth 
will be a bottleneck in the system. For 
this study we will assume that this is 
not a problem, and that with sufficient 
hardware a memory system can be built to 


supply the necessary bandwidth (eg. 
interleaved memory). 
Compaction and Scheduling 
Code generation for SIMAC can _ be 


divided into 3 components (not neces- 
Sarily done in 3 Separate phases), gen- 
eration of sequential MOPS, compaction 
of MOPS into PIs, and ordering of MOPs 
within a PI. The need for the first 2 
components should be clear. The third 
component is required to allow the exe- 
cution of two PIS containing different 
length MOPs (eg. addition and multipli- 


cation) to overlap using only simple 
hardware controls (figure 2). 
All MOPS within a single parallel 


instruction (PI) must begin execution in 
the same cycle, but they may end at dif- 
ferent times. A simple mechanism is pro- 
vided to allow for overlapping the exe- 
cution of 2 PIS assuming both PIs do not 


475 


use all PEs. A new PI will be issued by 
the CP aS soon as all PES for which the 
new PI contains an MOP are ready. This 
makes it possible for the software to 
schedule long MOPs on high numbered PEs 
and continue uSing the lower numbered 
PES for short MOPs. As shown in the 
following example, by properly schedul- 
ing the MOPs into PEs, the two additions 
can overlap the execution of the multi- 


ply. 


program segment 
X=a*b 
y=at+bt+c 


sequential code 

Ri <-R2*R3 ; MOP1 
R4 <-R2+R3 ;MOP2 
R4 <-R4+R5 ;MOP3 


compacted code 


Pit: Ri <-R2*R3B/ R4 <- R2+ RB 
Pig: R4 <- R4 + R5 


Fea Fst 
PE i MOPL.| = 


Figure 2. Scheduling example. 


MOPS into 
the processor 


Software Scheduling of 
PIS is an instance of 
scheduling problem. The problem can be 
described as follows: given a set of 
tasks tl,t2,...,tn and a partial order 
on those tasks which specifies that if 
task ti<tj, then ti must complete before 
tj} may begin. Each task takes some 
length of time Ti to be processed = and 
there are m identical processors, 
pl,p2,...-,pm on which to process’ these 
tasks. A schedule is an aSSignment of 
the tasks to discrete time units such 
that: 


Lis No more than m taskS are assigned 
to any time unit (one for each pro- 
cessor). 


2. If ti<tj, 
time unit at 
before tj. 


then ti is assigned to a 
least Ti time units 


The problem is to find the schedule 
which completes all tasks in the shor- 
test number of time units. This problem 
is NP-complete. The question then 
becomes, can a good approximation be 
found? Fortunately much work has been 
done in this area. Various "list 
scheduling" algorithms were compared in 


a report by Adam, Chandy and Dickson[1l]. 
They found that HLFET (highest levels 
first, estimated times) was the best 
of those tested, and in many cases actu- 
ally achieved the best known lower 
bound, that of Fernandez-Bussel[3]. 
Most cases tested came within 9@.2 per- 
cent of the lower bound. In this method 
tasks are asSigned priorities equal to 
the length of the longest chain from the 
task to the end. Tasks are then 
scheduled into time units from a list of 
data ready tasks with the highest prior- 
ity data ready task being scheduled 
first. 


Preliminary results 


The portable C compiler[7] has been 
modified to generate sequential code for 
SIMAC. The output of the compiler is 
then compacted (Scheduled) into parallel 
instructions. The optimization and com- 
paction is done ina post pass and is 
currently 3008 lines of C. The dynamic 
performance of the sequential code is 
then compared with the parallel code 
uSing a Simulator. The simulator is 
currently 16@@ lines of C. 


Tables 1 and 2 present the results 
of simulations run on_- several bench- 
marks. All times are reported in terms 
of speedup relative to the optimized 
sequential output of the portable C com- 
piler. 


Table 1 lists the speedups achieved 
4 procesSing elements. For these 
it is assumed that memory reads 
require 2 cycles, multiply 
require 4 cycles and all 
other MOPS require only 1 cycle. The 
first two columns show speedups’ for 
local compaction and local plus global 
compaction. Column three shows speedups 
assuming that PIs with different length 
MOPs (figure 2) are not allowed to over- 
lap. The last column shows the speedups 
assuming that all MOPS take only one 
cycle to execute. 


using 
tests, 
and writes 
and divide 


Table 2 lists the speedups achieved 
using various numbers of processing ele- 
ments. For all of the speedups in table 
2, both global and local compaction were 
performed as well as overlapping PIs 
with different length MOPs. 
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Using 4 Processing Elements. 
Program Local Global NoPI Equal 
Motion Motion Overlap  MOPs 


quicksort 1.6 1.6 
bubblesort 1.8 2.4 1.9 
testio 1.7 1.7 1.3 : 
factorial 1.9 1.9 1.8 a 
prime 1.3 1.4 1.4 7 
puzzle 1.4 1.7 1.4 a 
search 1.4 1.5 1.4 .B 
acker £0 2.4 2.0 6. 
LLL 2.2 Re 1.4 6 
matrix 2.6 2.6 1.8 0 
2.2 2.4 9 
1.6 


Table l. 


Speedup with 4 PEs. 


quicksort 
bubblesort 
testio 
factorial 
prime 
puzzle 


search 
acker 
LLL 
matrix 


BlooooR TRAD 


E 


Speedup with different numbers of PEs. 
Table 2. 


Conclusion 


A new architecture 
low level 


for exploiting 
parallelism has been 
presented. The principle of parallel or 
horizontal control has proven itself to 
be very effective in the microprogrammed 
control of hardware. SIMAC is an innova- 
tive and useful extension of this’ prin- 
ciple to a higher level. SIMAC has 
increased the amount of parallelism that 
can be exploited in general-purpose pro- 
grams uSing relatively simple hardware. 


SIMAC moves the burden of schedul- 
ing from the hardware at execute time, 
to the software at compile time, which 
has resulted in greatly simplified 
hardware desSign, and reduced execution 
time, when compared to other machines 
with low level parallelism. 


Due to the limited number of PEs, 
SIMAC will never be able to fully 
exploit the massive parallelism of vec- 


tor operations, as is being done by some 
highly parallel array processors[2,4]. 
However SIMAC is capable of exploiting 
the parallelism in non-vector calcula- 
tions, which has_- previously been done 
only on data flow machines and to a lim- 


ited degree in the pipelines of some 
machines[13,8]. 
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Abstract -- Hierarchical micro-architectures have 
been designed and developed for a_ two-level 
microprogrammed multiprocessor computer, MUNAP. 
In the MUNAP, a 28-bit microinstruction simultane- 
ously drives several nanoprogram streams of 40-bit 
nanoinstructions in four 16-bit processor units. 

On the basic microinstruction and nanoinstruc~— 
tion sets, the micro-assembly language level 
(Level 1) architecture has been defined to allow 
the user to describe frequently used micro-nano 
combinations and microprogram level SIMD opera- 
tions in one statement. Experimental results show 
that the description of parallel processing is 
divided into (i) 33.3 4 micro-nano combined state- 
ments, (ii) 41.9 4% parallel nanostatements for the 
SIMD operations, and (iii) 24.8 % different nano- 
statements for the MIMD operations. 

The system description language level (Level 2) 
architecture is defined on the Level 1 primitives. 
This level has many features common to other’ sys- 
tem description languages on such items as data 
types, operators, and functions, and some MUNAP 
dependent features. The experimental results show 
that the average processor utilization for each 
machine cycle varies from 1.1 to 3.8 within 4 pro- 
cessor units depending up on the content of pro- 
cessing. 


1. Introduction 


A significant trend in computing system 
design is the implementation of a computer that 
can operate on a wide variety of problems’ with 
high efficiency [5, 9]. We have developed a 
research-oriented multiprocessor computer, MUNAP 
(MU1ti-NAnoProgram machine), as a wehicle for 
solving a wide range of nonnumeric and associated 
problems [l, 4]. The design objectives of the 
MUNAP system are: 1) the system should support 
basic functions for a wide range of nonnumeric 
processing through firmware and hardware; and 2) 
as a research vehicle, the system should provide a 
powerful, yet flexible, architecture. In order to 
attain these objectives, we have designed a new, 
multiprocessor computer architecture, controlled 
via a two-level microprogramming scheme. Archi- 
tecture comparisons of MUNAP with other possible 
architectures, such as_ single-level versus two- 
level controls, and multiprocessor versus unipro- 
cessor, verified the advantages of MUNAP for such 
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parameters as control storage requirements, execu- 
tion steps, and multiprocessor unit parallelism 


[4]. 


The key for utilizing such an innovative 
machine efficiently is to provide the users with 
good architectural views [9]. The difficulty is 
that we must define an architecture which not only 
aids the programming process but also utilizes the 
basic hardware features, such as a parallelism 
among multiple processors, a two-level micropro- 
grammed control, and nonnumeric processing units. 
The higher we define the architecture, the more 
difficult it is for the user to use the hardware 
features efficiently. The lower we define it, the 
more difficult the programming process is because 
of its innovative hardware organization. In ordi- 
nary machines, the machine language is defined as 
an interface between software and firmware (or 
hardware). This makes it difficult to utilize the 
micro-architecture level parallelism, because the 
level of machine language architecture is too low 
and too rigid to utilize such parallelism. To 
solve the problem, we have designed and developed 
hierarchical micro-architectures; the micro- 
assembly language level architecture [2] and the 
system description language level architecture. 
In the former architecture, the user can describe 
concurrent operations in one _ statement. The 
operations include the micro-nano combined opera- 
tion and the micro-level SIMD operation. In the 
latter architecture, the macro functions are 
directly implemented in the micro-assembly 
language to provide a relatively high-level archi- 
tecture that implicitly contains parallelism of 
multiple processors and nonnumeric' processing 
functions. These hierarchical architectures have 
been provided in MUNAP for users who have various 
system requirements. Several experiments have 
been made to clarify the effect of the hierarchi- 
cal micro-architectures. 


In this paper, we will describe each archi- 
tecture on such items as design objectives, 
language features, processing, and experimental 
results. Based on the results, we consider the 
effect of multiple micro-architectures for provid- 
ing the user “easy to use" and “efficient” inter- 
faces, based on a multiprocessor computer with 
innovative hardware organization. The experimen- 
tal results also show how many processors are 
active on average when we apply multiple proces- 
sors to ordinary (not parallel) problems. 


2. Basic Concepts of Hierarchical Micro- 
architectures 
2.1 Hierarchical micro-architectures 

At the initial stage of hardware development 


for MUNAP, we were required to do hand-coding for 
checking the hardware by running test micropro- 
grams. The experience taught us that it is very 
difficult to describe a microprogram with multiple 
nanoprograms. The nanoprogram level MIMD 
feature makes the process more complicated. At 


times, we had to write 28-bit microinstruction 
with 4 40-bit (i.e. 160-bit) mnanoinstructions to 
specify the control for one machine cycle. Based 


on the experience, we decided to develop an archi- 


tecture that is not only easy-to-use but effi- 
cient. To satisfy these contradictory require- 
ments, the hierarchical micro-architectures have 


been developed. 


Figure 1 shows the basic idea of hierarchical 
micro-architectures. At the lowest level, the 
multiple nanoinstruction sets are defined for mul- 
tiple processors. The meaning of a microinstruc- 
tion is partially determined by the nanoprograms. 
The microinstruction set is defined on the nano- 
level architecture. This micro-level instruction 
set, combined with the nanoinstruction sets, 
represents the visible micro-architecture of the 
machine (Level 0). On the Level O architecture, 
we define a higher level, enhanced view of the 
machine to make it easier to develop microprograms 
by utilizing symbolic expressions for arithmetic 
operations and sequencing, and providing the user 
a facility for describing the combined micro-nano 
operations in one statement (Level 1). The key 
for defining the Level 1 architecture is that it 
should not lose the flexibility for specifying 
micro-nano combinations and the parallelism among 
multiple processors. When we met a decision point 
for designing the language, that -is, easy-to-use 
or efficient, we chose efficiency or prepared two 
types of expressions, one for efficiency and _ the 
other for ease of use for the same operation 
(examples are in later section 3.2). At the next 
level, a higher view is defined as a system 
description language level (Level 2). This level 
provides problem solving capabilities by including 


Level 3 
High-Level Language 


System Description Language 


Micro-assembly Language Level 1 


Microinstruction Set 


Level 0 


Fig. 1 Basic concepts of hierarchical micro-architectures. 
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Fig. 2 MUNAP hardware organization. 


several data types, operators, and functions, as 
in usual system description languages [6]. 
Further, the tagged architecture is defined to aid 
the user’s debugging process and realizing dynamic 
data type transformation. An additional layer to 


be considered is derived from the varieties of 
user specialties that may surround the system 
(Level 3). By using the Level 2 facilities, we 


can describe a high level language processor, such 
as the PASCAL compiler for Level 3. 


2.2 Hardware organization 


The basic hardware organization of MUNAP, 
that supports the above architecture, is outlined 
here along with the basic micro-nano interaction 
mechanism, and some software and hardware support 
systems, developed on the console processor of 
MUNAP. Figure 2 outlines the data flow of MUNAP. 
There are one microprogram memory (MPM) and four 
nanoprogram memories (NPM) within four identically 


constructed processor units (PU). A 28-bit 
microinstruction (al) simultaneously drives 
several nanoprogram streams in the four 16-bit 


processor units. A 40-bit nanoinstruction (nI) 
has a 1l-bit field to specify the end of a nanopro- 
gram at each execution step. When all _ the 
nanoprograms end, the next microinstruction is 
activated. Since the nanoprogram memory is also 
reloadable as the microprogram memory, this allows 
the user to specify any combination of nanoin- 
structions in multiprocessor units. 


Several hardware units are distributed 
the microprogram level and nanoprogram level. 
microprogram-controlled units are the four levels 
of l6-number, 4-bit segment shuffle exchange net- 
work (SEN) with exchange and broadcast cells for 
interconnection between processors and main 
memories and for data permutation [7, 8], and the 
8 banks of main memory modules (MM) with address 
modifier (AM) for variable length word access and 
two dimensional table access. The nanoprogram 
controlled units are arithmetic and logic unit 
(ALU), bit operation unit (BOU) for bit count, bit 
test and priority encode, and field division and 
concatenation unit (DCU). The other units include 
the micro stacks (MSTK) and general registers 
(REG) at the micro level, and the register file 
(RF), scratchpad memory (SPM), counter (C), flag 
register (FLR), and port registers (IPR and OPR) 
at the multi-nano level. 


among 
The 


The ECLIPSE S/130 minicomputer is attached to 
MUNAP as a console processor. It allows the user 
to read/write the content of MUNAP facilities and 
start/stop the microprograms. The loader of two- 


level microprograms and the evaluater for getting 
the run-time data are also available on the 
ECLIPSE. MUNAP has been designed and constructed 
with about 2500 IC“s at our laboratory. For the 
detail of MUNAP organization, see [4]. 

In the following sections, we will concen- 


trate our discussion on the design objectives and 
the language processing for Levels 1 and 2. The 
experimental results are also shown to verify the 
effectiveness of the design. 
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3.1 Design objectives 


In order to obtain an efficient microprogran, 
allowing the user to utilize the hardware 
features, the register-transfer-level language has 
been designed and implemented [3]. (The later 
Figure 5(b) shows a sample description.) The 
language features are summarized in the following 
two types: 


(1) Description of two-level microprograms: Basi- 
cally, the user can describe any combination of a 
microinstruction and multinanoprogram, activated 
by the microinstruction, in order to efficiently 
make use of the flexibility of two-level micropro- 
grams. The microprogram and nanoprograms are 
described in a sequence and distinguished by 
indentation. At this level, the hardware func— 
tions are uniform and frequently used functions 
may be described in one statement without being 
aware of the operations of elementary micro- and 
nano-instructions. The following Examples 1 and 2 
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illustrate the uniformity and one statement 


description, respectively. 


(Example 1) *IF POS2 = 1 THEN GOTO LO; 

At the Level 0, there is no flag that indi- 
cates the data is positive. This is to avoid 
the redundancy of hardware that has test 
functions for zero and negative data. How- 
ever, this causes inconvenience for the user. 
The virtual flag, mamed POS, is defined at 
the Level 1 to make the test functions uni- 
form. 

(Example 2) SPM 0,1(5) := RF 2,3(6);3 

The contents of register file RF(6) are read 


in parallel from processor units 2 and 3, and 
sent to SPM in processors O and 1 through the 
shuffle exchange network (SEN). This 


statement is decomposed into 1 microinstruc— 
tion and 4 nanoinstructions as described 
later. 


(2) Description of parallel processing: The 
parallel processing of MUNAP is divided into three 
categories: (i) microinstruction and multiple 
nanoinstructions are tightly coupled to perform a 
single task; (ii) SIMD operation, in which multi- 
ple PU“s do the same operation; and (iii) MIMD 
operation. Example 2 is an example of (i). 
Although the MIMD operation should be described in 
several statements, the SIMD operation may be 
described in one statement. Examples 3 and 4 show 
the examples for MIMD and SIMD operations, respec- 
tively. 


(Example 3) SPMO(A) := RFO(B) <4) 1; 
SPM1(A) := RF1(B) <b 2; 
SPM2(A) := RF2(B) <+ 3; 


The statements represent the different (i.e., 
MIMD) operations in PU 0, 1, and 2. 


(Example 4) SPM 0,1,2(A) := RF 0,1,2(B) <4 4; 


This statement implies parallel additions in 
the PU 0, 1, and 2 by three nanoinstructions. 


3.2 Decomposition into two-level microprograms 


The language processor has been developed in 
PL/I consisting of 6200 statements. The major 
features of the translation are the following. 


1) 


divide a micro-nano combined statement into a 


microinstruction and nanoinstructions, and 
assign the nanoinstructions to appropriate 
PU’ s; 

2) extend a statement for multiple PU’s to 
several statements and assign them _ to 
appropriate PU’s; and 

3) optimize the two-level microprograms. 

We will illustrate about 1) and 2) by using the 


sample translations. The optimization conditions 
and its implementation issue are found in [3]. 


(Example 5) Translation of the Example 1 
statement as an example of feature 1). 


To explain the translation process, the 
mechanism for reflecting the parallel test 
results in multiple processors to the micro- 
level sequencing is shown in Figure 3. In 
each PU, a nanoinstruction selects 2 flags 
from the 32-bit flag register (FLR), does 
logical operations (f0) on the 2 flags, and 
sets the result to the TEST flag for each PU. 
Another microinstruction does logical opera- 
tions (fl) on the 4-bit TEST flag, and the 
result is used for a branch condition at the 
microprogram level. Thus, the IF statement 
of Example 1 is decomposed as follows: 


m: IF TEST2 
n2: IF NEG2 = 


1 THEN GOTO LO; 
1 NOR DZ2 = 1 THEN SET TEST2; 


Notice that m represents microinstruction and 
ni represents nanoinstruction of PUi. We 
show the contents of each object micro- or 
nano-instruction not in a bit pattern format 
but in the statement format to aid the reada- 
bility. Notice also that n2 is executed 
before m according to the timing constraints 
[4]. Thus, the “negative or" operation (NOR) 
for the negative flag (NEG) and zero flag 
(DZ) of PU2 realizes the virtual positive 
flag (POS). 


(Example 6) Translation of the 
statement as an 
2) 


Example 2 
example of features 1) and 


The statement is decomposed as follows. 
m: OPR -> <32-bit shift by the SEN> -> IPR 


n2: RF2(6) -—> OPR2 nO: IPRO —> SPMO(5) 
n3: RF3(6) -—> OPR3 nl: IPRI -—> SPM1(5) 


The IPRi, OPRj represent the port registers 
for input to PUi and output from PU}, respec- 
tively (see Figure 2). The nanoinstructions 
n2 and n3 read out data from the register 
file to the OPR of PU’s 2 and 3. The 
microinstruction m controls transfer from the 
OPR 2,3 to the IPR 0,1, through the SEN. The 
nO and nl write the data on the IPR O,1 into 
the scratchpad memory. 


Notice that the system allows the user to use not 
only the simple, one statement description, such 
as Example 1, but also the direct description of 
Level O operations, such as Example 5. These 
options are the answer to the contradictory 
requirements of “ease of use” and “efficiency” for 
language design, as described in 2.1. 


3.3 Experimental results 


We performed an experiment in order to evalu- 
ate the effectiveness of the Level 1 architecture 
on such items as (1) the way in which the two- 
level microprograms are described by using micro- 
nano combined statements, (2) the way in which the 
parallel operations are described by using micro- 
nano combined statements and parallel nanostate- 
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Fig.3 Parallel test by multiple processors. 


ments, and what the degree of parallelism is, and 
(3) the effectiveness of translation process for 
supporting high-level description. Ten problems 
were given to the members of our laboratory wha 
were knowledgable about the micro-assembly 
language and support system on the console proces- 
SOT. The results are shown in Table 1. Notice 
that the capital letter in paranthesis in the fol- 


lowing description corresponds to an item in Table 
Lis 


(1) Description of two-level microprograms: 
Microinstructions — are described in micro- 
statements(B), which do not include nano-level 
operations, and micro-nano' statements(D). The 
micro-nano statements(D) account for 33.1 % of the 
total microinstructions(B+D) used in the micropro- 
grams. This significant usage demonstrates’ the 
effectiveness of the micro-nano combined state- 
ments for describing tightly coupled micro-nano 
operations. 


(2) Description of parallel operations: The 
description of parallel processing is 27.0 % of 
total descriptions. (This percentage came from 
the careful reading of all the object micropro- 
grams and not expricitly shown in Table 1.) 


Table 1 Experiment results at micro-assembly language level 


Microprogram Number 1 2 3 4 5 6 7 8 9 10 


Number of Statements (A) 106 138 179 134 144 61 145 187 159 95 


Micro lp-On (B) 35 35 76 38 51 20 35 64 33 23 
Nano Op ln 33 79 61 55 32 30 37 78102 59 
Op-2n (C) 4 12 0 0 6 3 6 0 0 (0) 
Oy-3n 0 0 0 0 3 1 5 2 0 ¢) 
Op-4n 18 0 4 7 30 0 37 21 6 3 
Micro- 1 p lon 9 5 11 19 13 1 6 9 2 2 
Nano 1A | ce aa 0 (D) 0 4 0 11 0 0 6 1 0 0 
lp-3n 0 0 0 0 0 2 f 0 0 0 
1p 4n 7 3 27 4 9 4 11 12 16 8 
Number of Nanoinstruc- 
tions after Decomposition 
PuO 63 7 93 87 54 35 88 75 53 30 
PUl (E) 31 77 34 18 58 7 68 60 56 31 
PU2 28 41 34 #17 #«55 7 69 47 43 23 
PU3 28 3 35 20 54 10 51 45 40 21 
Total (F) 150 128 196 142 221 59 276 227 192 105 


p: microinstruction 
n: nanoinstruction 


According to the classification of 3.1 (2), we can 
classify them into three categories: (i) micro- 
nano combined statements (33.3 4), (ii) parallel 
nano statements (41.9 4), and (iii) different nano 


statements (24.8 %). The item (i) corresponds to 
(D) and items (ii) and (iii) are included in (C). 
These results show the effectiveness of our 


approach for describing parallel processing by the 
micro-nano combined statements and the parallel 
nano statements for the SIMD operations. The 
items (i) and (ii) are further divided according 
to the number of PUs used. In item (i), the ratio 
between 1 micro - 2 nano (i.e., 1 microinstruction 
activates 2 PUs at the same time), 1 micro - 3 
nano, and 1 micro - 4 nano is 7:1:34 (see (D)). 
In item (ii), the ratio between 2 nano, 3 . nano, 
and 4 nano is 3:1:10. These results show that 4 
PUs are effectively utilized for various problems. 


(3) Translation of two-level microprograms: 
number of nanoinstructions(F) after decomposition 
is 2.3 times that of explicitly described 
nanoinstructions(C). This is due to (i) the 
implicit description in the micro-nano combined 
Statements, and (ii) the parallel nano statements 
that allow the user to describe parallel opera- 
tions in one statement. These results show the 
effect of high-level description at Level l. The 
item (E), the number of nanoinstructions for each 
PU after decomposition, shows the locality of mul- 
tiple processor’ usage. In some problems, for 
example, microprograms 2, 3, 4, and 6, the local- 
ity is evident. 


The 


— craic ideemeeenemenminommmmessemescremmeereecermemmrinemme cetera ec 


4.1 Design objectives and language features 

The micro-~assembly language has provided the 
relatively high-level micro-architecture to the 
user. However, it is still difficult to describe 
large, utility programs or application programs in 
such a language. _ 


The following objectives are defined for 
designing the MUNAP System Description Language 
(MSDL): 


(1) Definition of high-level architecture: On 
the Level 1 architecture, we defined a rich set of 
data types, operators, and functions. Included 
are, in particular, data types of two dimensional 
array and structure, control statements of IF, 
WHILE, FOR, and SWITCH, opeartors of data exchange 
and concatenate, and functions for bit and string 
operations. 

(2) Utilization of hardware features: To utilize 
the hardware features of MUNAP, we represent some 
of them in the language. Examples are string 
functions for nonnumeric function units such as 


BOU and DCU, and shift and exchange operators for 
the SEN. Some of the data types, such as flag, 
are also represented. This is a compromise 


between the high-level, problem-oriented architec- 
ture and low level hardware organizations. 


(3) Tagged architecture: 
tive 


The goals of an effec- 
computer architecture are not only the 
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efficient processing of large amounts of data but 
also the enhancement of the debugging process and 
the enhancement of reliability of the computing 
system [5, 9]. Higher processing capability, 
obtained by the parallel processing, should be 
applied not only for processing large amounts of 
data at high speed but also for improving the user 
interface by semantic checking during program exe- 
cution. To implement these concepts at the system 
description level, we designed the tagged archi- 
tecture. This architecture is expected to provide 
the facilities for (i) detecting several kinds of 
errors at run time such as refering to an _ unas~ 
signed data value, and (ii) automatically 
transforming the data types of operands. These 
two items aid the development process of programs. 


4.2 Parallel processing of MUNAP System Descrip- 
tion Language (MSDL) 


Basically, the source program written in MSDL is 
translated into an intermediate form by the host 
processor ECLIPSE, and then interpretively exe- 
cuted by MUNAP. In order to make use of the MUNAP 
micro-architecture features, the intermediate 
language has one _ to one correspondence with the 
MSDL source statement. The translator and _ the 
interpreter are described in the ECLIPSE assembly 
language and the MUNAP micro-assembly language 
(i.e.,the Level 1 language), respectively. The 
interpreter consists of 3.2 K microinstructions 
and 7.6 K (1.9 K for each PU) nanoinstructions. 


We will concentrate our discussions on _ the 
processing features supported by the parallelism 
and nonnumeric functions of MUNAP. For the detail 
of intermediate language formats and the process- 
ing, see [10]. 


arith- 
metic and logical operations are executed in 
parallel in the four PUs for each operator of 
MSDL. The type check for operands and, if neces- 
sary, the translation to an appropriate type are 
also done dynamically. The information about the 
result is stored in the tag field, as described 


later. 


As an example of parallel processing, we show the 
outline of the shift operator microroutine that 
does a N-bit circular shift on the 64-bit data. 
As 4 bits are the smallest unit of the SEN opera- 
tion, the SEN shifts the data 4 x D4 (D4 = N_ div 
4) bits in one machine cycle. Then, the ALU 
shifts it M4 (M4 = N mod 4) bits, one bit by one 
bit. The use of the SEN reduces the number of 
machine cycles from 31.5 to 2.5 on the average. 
This example shows not only the enhancement of 
processing by parallelism but also the provision 
of a uniform function (in this case, shift) to the 


user. If we use the function at the micro- 
assembly language level, we must directly control 
the SEN and the ALU shifter to get appropriate 


' yesults. 


string functions are 
count and priority encode functions of the bit 
operation unit (BOU), field extraction and embed- 


ding functions of the divide and concatenate unit 
(DCU) in the four processor units, and the shift 
and broadcast functions of the SEN. For example, 
a bit string extraction function BSUBST(BIT, POS, 
N) extracts N-bit data from the bit position of 
POS-bit of 64-bit variable BIT. To do this opera- 
tion, this routine first gets the data from 8 
banks of MM to register file (RF) in 4 PUs in 
parallel. Then, the data is shifted (POS - 1) 
bits by using the SEN and the ALU shifter. After 
computing i (= N div 16) and j (= N mod 16), the 
data is concatenated with O at the (j+1l)-bit at 
PUi, and is stored into 8 banks of MM in parallel. 


This example shows the difficulty and tediousness 
of handling the multiple processing units, espe- 
cially if they have some specific features, such 
as nonnumeric processing functions. These func- 
tions provide the user a high-level but efficient 
interface by doing tedious and, sometimes, tricky 
operations for utilizing the parallelism of the 
micro-architectures instead of the user. 


(3) 


variable 


Implementation of tagged architecture: Each 
in MSDL has 26-bit tag field as shown in 
Figure 4. The check points are divided into two 
major parts: (a) checks at the fetch and operand 
access phase, and (b) checks in the execution 
phase. The checks for item (a) include: (i) sys- 
tem variable error (check the range of system 
variables such as intermediate language instruc— 
tion counter), (ii) parameter error (check the 
number, attribute, and order of formal parameters 
for procedure call instruction), and (iii) access 
error (check if the MM and SPM addresses point to 
the user variable area, the operand value is 
defined, and the index of array is within the 
correct range). The checks for item (b) depend on 
the content of processing. In the arithmetic 
operations, for example, overflow, underflow, and 
division by 0 are checked. In both (a) and (b), 
the BOU and DCU functions ease the reference to 
the tag, which is divided into several fields. 


These checks may seem to be redundant. However, 
it not only enhances the user interface but also 
checks erroneous actions caused by incorrect input 
data. Further, the tagged architecture is made 


feasible by fast parallel processing of multiple 
processors. 


4.3 Experimental results 


To evaluate the effectiveness 
description language 


of the system 
level architecture, we per- 
formed an experiment. Table 2 shows the static 
information about the MSDL interpreter. That is, 
for each function, the number of microstatements 
(M), the number of nanostatements (N), the number 
of micro-nano combined statements (MN) in micro- 


Attribute 


Capacity] Overflow/ 
Underflow | Refer 


Number of 
references 


* bit length 


Fig. 4 Tagged data structure. 
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Table 2 Static data from interpreter 


Function of Micro Nano Micro- Active 
Module (M) (N) Nano (MN) PUs 
Control 
Main 5 2 3 1.80 
Initialize 1 20 7 3.77 
Fetch Instr. 2 13 21 2.66 
New Code 10 19 4l 1.79 
Operand Access 15325 21 22.29 1.64 
Decode 24.75 1.50 2.50 1.13 
Operators 
Arithmetic 6.07 8.93 15.66 2.09 
Logical 2 5 4 2.56 
Shift 13 20.16 45.23 1.92 
Exchange 10 11 15 2023 
Expression a2 6.33 18.67 1.95 
Statements 
Procedure 16 14.50 25.50 1.69 
IF 2 1.30 3.33 1.50 
GOTO 6 16 6 1.58 
FOR 6.25 12 10.6Q 2.19 
WHILE 4 7.67 10.33 1.86 
String Func. 
Bit 12 10 43.75 1.90 
Character 9.75 10 37.25 2.21 
Subroutines 5.54 4.67 9.29 2.12 
Average 9.05 8.71 16.66 2.00 


mame eeenen ee 


Table 3 Processing time ratio for tag processing 


Type Type Tag Result 
Check Transform Create Set 
Integer 28.24 46.56 14.50 10.69 
Real 31.71 40.65 16.26 11.38 
aia aa aaa 
assembly language, and the average PU numbers 
activated in each machine cycle are given. The 
following features are observed. 

(1) The numbers M, N, and MN represent’ the 
numbers of statements required for realizing 
the system description language level (Level 
2) by using the assembly level language 
(Level 1). 

(2) From 1.13 to 3.77 PUs are used for each 
machine cycle in microprogram modules. The 
average number of active PUs for all the 
micro-routines in the interpreter is 2.00. 


Further, as a result of optimization, we have 
improved it to 2.19. The numbers for the 
control part varies greatly. However, in the 
routines for operators and functions, the 
number of active PUs are around 2. 


Further, the results of dynamic data are summar- 
ized as follows: (detailed data is found in [10].) 
(3) The ratio of micro (M), nano (N), and micro- 
nano combined (MN) are 19:23:58. This shows 
the effective use of Level 1 micro-nano com- 
bined statements at Level 2. 


(4) 


The average number of dynamically active PUs 


is 2.14. 

(5) The overhead caused by error checks and 
related operations is classified into 4 
categories as shown in Table 3. Type _ check 


and type transformation are major parts. 


Items (2) and (4) provide a guideline of how many 
processors are active on average when we apply 
multiple processors to ordinary problems (not spe- 
cial parallel problems, such as array processing). 


5. Effects of Architecture Hierarchies 

The effects of architecture hierarchies are 
illustrated in Figure 5, where a sample bit count 
function is described at three levels. At the 
lowest level, we must describe the microprograms 
in bit pattern format for multiple processors 
(Figure 5(c)). At the highest, system description 
language level, it may be written as ae single 
function call(Figure 5(a)). At the middle level, 
the microprogram is written in register transfer 
language (Figure 5(b)). 


To make the difference clear, we summarize it 


in Table 4. The following items are observed from 
the table. 
(1) The higher the level of language is, the 


richer the facility is. This can be general- 
ized to all the aspects of language, such as 
the data structure, arithmetic and logical 
operators, and control functions. 
(2) At the lowest level of Level 0, the user must 
take care of the parallelism and two-level 
control scheme of the bare machine. At Level 
1, the frequently used micro-nano combina- 
tions, and the instructions with the SIMD 
feature, may be described in one statement. 
But these features do not completely hide the 
hardware features, such as parallelism among 
multiple processors and two-levels of con- 
trol, from the user. At Level 2, such 
hardware features are almost hidden from’ the 
user, and problem-oriented functions are pro- 
vided. | 
(3) The utilization of multiple processor paral- 
lelism does not change between Levels 0O and 
1, because they have the same description 
capability. However, it slightly decreases 
from Level 1 to Level 2, in exchange for 
independence of parallelism recognition by 
the user. 
(4) The error check function is only provided at 
Level 2 to aid the programming process and 
enhance the system reliability. 
(5) The extensibility of the language differs 
from Levels 1 and 2. The extensibility at 
Level 1 corresponds to the extensibility of 
hardware such as the addition of a new 
microinstruction field or micro-order. The 
Level 2 extension is the addition of new 
functional modules. 
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decreases 


BCT(BIT, 1) 
(a) Level 2 
ST-NO. S TATEMEN T 
1 MICRO MAIN BITCOUNT (100); 
2 EXT NEXT (2); 
3 x; 
4 RF(2) := 0; 
5 CxO := 0; 
6 Li: AM MODE M8 (X,H) PU( 3-0); 
7 OPRO := RFO(1) “+> CX0; 
8 SPM(10) := MM; 
9 RF(3) := ¢BCT,1> SPM(10); 
10 RF(2) := RF(2) ¢+> RF(3) CX0+1; 
jl *IF CXO MOD4<¢>0 THEN GOTO LI; 
12 IPR := <€SRL16> OPR; 
13 OPRO := RFO(2); 
14 OPR1 := RF1(2) ¢<+> IPRI; 
15 OPR2 := RF2(2) <+> IPR2; 
16 RF3(4) := RF3(2) <+> IPR3; 
17 GOTO NEXT; 
18 END; 
(b) Level { 
* see [4] for comments 
Micro 
Address 
28-bit microinstructions 
Nano 
Address 
50 
40-bit nanoinstruc- 
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tions for PU#O 


a a 


Fig. 5 Sample bit count programs at Levels 2, 1, and 0. 
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(c) Level 0 


The above comparison shows that we have defined a 
reasonable interface as a compromise between the 


user’ s requirements and the multiprocessor system 
throughput. 
6. Concluding Remarks 

We have developed hierarchical micro- 


architectures for a two-level microprogrammed mul- 
tiprocessor computer. The results from the 
development of language processors and some exper- 
iments show the effectiveness of such architec- 
tures. These results will be especially useful 
for defining a "easy to use" but “efficient to 
implement" architecture on a machine with innova- 
tive hardware organization. As the hardware cost 
it becomes feasible to construct such 
machines in many application areas. Thus, the need 
for defining a good architecture on the machine 
will increase. 


The experimental results also show the impor- 
tant guideline that about 2 of 4 multiple proces- 


Table 4 Comparison among three microarchitecture levels 


Level 0 


Language Features 


Data Structure Integer (16), Character, 


Boolean 


Arithmetic Ops.* Functions of bare 


and Test hardware 

Control Branch, Branch on 
condition 

Extensibility Equals that of hardware 


Level Il 


Integer (16), Character 
Boolean 


Level O plus combined ops. 
for transfer and test 


IF, GOTO, CASE that correspond 
to the LO functions. 


New microinstr., Microinstr. 


Level 2 


Integer (16,32,64), Real (32,64), 
Character, Boolean 


Problem-oriented operators and 
functions 


IF, WHILE, FOR, SWITCH in 
problem-oriented format 


New microprogram module 


field, Micro-order 


Architectural Features 


Parallelism Direct descrip. by users 

ops. 
Two-Level Direct descrip. by users 
Microprog rams 
Uniformity Limited by avoidance of 


hardware redundancy 


Facilities for 
Program Test 


Debugger to run and 
monitor microprograms 


*ops.: 


sor units are activated on an average, even if the 
system is applied to ordinary (not parallel) prob- 
lems. This means that in most machine cycles mul- 
tiple (i.e., from 2 to 4) processors are activated 
in parallel. The experienced microprogrammer 
efforts for exploiting inherent parallelism within 
the problems yield such results. 


Our future problem is to develop the applica- 
tion programs, such as a database system, in the 
system description language and verify the effec- 
tiveness of the architecture from the viewpoints 
of 1) parallelism utilization of multiple proces- 


sors and nonnumeric units distributed under two- 
levels of control and 2) effectiveness of the 
tagged architecture for software development. 
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ALTERNATIVE DATA STRUCTURES FOR LISTS 
IN ASSOCIATIVE DEVICES 
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Abstract -- If the full power of parallel 
associative devices are to be utilized, 
alternative implementations of standard logical 
data structures must be developed. This paper 
presents three methods for implementing lists 
suitable for associative memories and processors. 


based on 
tree structure implementation of a 
list. Its main advantage is that storage 
allocation and release are straight forward and 
garbage collection avoided. The CDAR’ encoding 
technique stores the tokens of the list with a 
code that encodes their position. This technique 
supports sublist matching since it allows any CDAR 
code range to be searched for in constant time. 
The EPS method is similar to the CDAR code method. 


The associative tree technique is 
the logical 


The main difference is that the position 
information is more efficiently encoded allowing 
more compact storage of long lists but requiring 
slightly more complex searching. All three of 


these techniques appear well suited for 
implementing list based languages such as LISP and 


PROLOG on associative processors with the 
potential of substantially improving program 
execution speeds. 
Introduction 
The Symbolic or S-expression is a 


parenthesised list of atomic symbols which forms 
the basic data structure for many AI oriented 
languages such as PROLOG and LISP. In 
conventional computers, the lists are logically 
implemented as tree structures which in turn are 
normally physically implemented with linked lists. 
The use of lists in AI programming research is 
popular because it supports symbolic manipulation 
which allows complex algorithms to be programmed 
relatively easily. Unfortunately, programs 
written using lists for data and program storage 
tend to result in slow program execution when 
processing data bases of non-trivial sizes. If 
LISP, PROLOG and other high ievei languages using 
lists are to be used in practical applications 
with large data bases, new faster alternative data 
structures must be developed. 

Several approaches are possible to address 
the problem of slow execution of list based 
languages such as LISP and PROLOG. One approach 
is to optimize the code of a conventional computer 
for list functions [2],[5]. Another is to 
optimize the linked list structure [3]. Both of 
these approaches speed up execution but still 
suffer from the inherent drawback of linked list 
memory allocation — garbage collection. The time 
for garbage collection can be dramatically reduced 


0190-3918/83/0000/0486$01.00 © 1983 IEEE 


by using Sth generation type devices such as 
associative processors and memories to implement 
conventional linked list storage [4]. However, 


this approach does not fully take advantage of the 
parallel search capability of these devices. 


This paper discusses three alternative 
approaches to conventional linked list structures 
suitable for list implementation in parallel 
associative memories and computers. 


Traditional Storage 


The binary tree representation of the list (A 
(B C) D) is shown in Figure 2-1. Figure 2-2 shows 


Figure 2-1 Tree Representation of (A (B C) D) 


| ---|-------- Lal 
|------- I |—-] 
| ~--|-| 
--------- | 
----|---, | <- 
[|  |--—---- | 
| [-I---. | 
| | —------- i-——-| 
(15) gee leeeeserss |p | 
| ——----- | |—-| 
| f nil | 
|  —--~----- |---| 
I-->| = .---]-------- I Bl 
| ------- l |—-| 
I-I---. | 
| --------- |—-| 
I>l = .---|-------- Ic l 
|------- l |—-| 
| nil | 


Figure 2-2 Link List Implementation of (A (B C) D) 


a typical linked list implementation. Many texts 
have been written explaining LISP and its S- 
expression implementations [7]. This section will 
simply review concepts important to the following 
discussion. Readers acquainted with LISP and list 
processing concepts may wish to skip this section. 


In LISP the two basic functions CAR and CDR 
allow the programmer to traverse binary trees or 
equivalently to generate sublists from lists. 
Basicly a CAR is an instruction to traverse to the 
left subtree from a node while a CDR causes a 
traverse to the right subtree. The equivalent 
actions in list notation for a CAR is the sublist 
obtained by extracting the left most element of 
the list. The CDR is the sublist composed of the 
remainder of the list after the first element has 
been extracted. Thus the CAR of the list (A (B C) 
D) is A. While the CDR is ((B C) D). CAR and CDR 
functions are frequently chained together. The 
chaining is abbreviated by writing simply A or D 
in sequence to represent CAR or CDR and _ then 
adding a single C and R as in CADDAR. The order 
of application is from right to left (inner most 
function call to outer most). 


CONS is 
(construction) 
inserts a new 
list. Thus 
(BC) D). 


another list manipulation function 
and is used to build lists. It 
element at the beginning of a 
the CONS of A and ((B C) D) is (A 


Associative Program Design Language (APDL) 


In the following sections, APDL will be used 
for algorithm descriptions implemented in 
associative memories. APDL reflects the fact that 
loop information is vital in a sequential computer 
program, but in an associative memory, it is 
redundant and consequently not needed. Thus the 
statement: 


FIELDj(*) := FIELDj (*)+CONSTANT 


is used in place of a loop. The * indicates that 


the statement is executed in parallel on every 
enabled word in the associative memory [6]. 
The responses to a parallel search can be 


processed sequentially by using the NEXT (get next 
responder index) and EOR (end of responders) 
functions and a ~ index variable notation. If 
there are no responders, i.e. all elements of the 
variable are false, EOR(RESPOND(*)) will be true 
and RESPOND(*)~ will be undefined. If 
EOR(RESPOND(*)) is false, then NEXT(RESPOND(*) ) 
will assign the internal memory index of the next 
true word recorded by RESPOND(*) to RESPOND(*)~ 
and sets the value for that word to false. 
RESPOND(*)~ can be used as an ‘index’ variable in 
place of * in any associative field reference. 


any field can be restricted to 
of a parallel search within the 
For example, 


In general, 
the responders 
scope of an IF statement. 


IF FIELDi(*) = 'NAME’ THEN FIELDj(*) := ‘value’ 


modifies only those elements of FIELDj whose 


corresponding elements of FIELDi equal NAME. 


In an associative memory, storage allocation 
is easily handled by defining one of the fields to 
be a STATUS field. Thus if a new word is needed, 
its index is obtained by the code: 


AVAIL(*) := STATUS(*) = ‘IDLE’ 
NEXT (AVAIL (*) ) 

INDEX := AVAIL(*)~ 
STATUS(INDEX) := ‘BUSY’ 


Conversely, when a word is no longer needed, it is 
returned to the available storage category simply 
by writing IDLE in its STATUS field. 

In the following sections, associative code 
will be used for algorithm descriptions 
implemented in associative memories. All 
associative statements are easily understood if it 
is kept in mind that a * index is equivalent in a 
sequential computer to a loop through all words in 
the memory. 


Associative Structures 
Associative Trees 


Figure 4-1 illustrates a straight forward 
implementation of the traditional tree 
representation of a list in an associative memory. 
The primary difference in this implementation is 
that pointers are replaced by a node name or ID. 
Each link record contains its own ID and as_ such 
is ‘relocatable.’ That is, any node record can be 


stored in any memory location (word). In figure 
4-1, the data is sorted on the ID field for ease 
of presentation and understanding. In an actual 


associative memory, the 9 records shown could be 
in any order in any of the memory’s words. The 
actual configuration would be a function of the 
values in the Status Field at the time the entries 
were made. 


In this associative memory implementation of 
a tree data structure, alist or sublist is 
designated by the ID of the root node of the 
logical tree structure. Thus the list (A (B C) D) 
is designated by ID #1. , oe 


The implementation shown stores a pointer’ to 
the atom in the left cell instead of the atom 
itself, thus the CAR function is simply 
implemented by chaining on the left child node ID. 
The CAR of ID #1 is ID #6 which is the _ storage 
node for atom 'A’. The CDR of ID #1 is ID #2 
which is the node ID that designates the list ((B 
C) D). List construction (CONS) is accomplished 
by generating a new entry in any available word of 


memory (i.e. idle status) with the appropriate 
node IDs. Thus CONS of A, node ID #6, and ((B C) 
D), node ID #2 would generate the following and 


return node ID #m. 


| BtmttwNutieo6e«4t 2 41 
| —--------------+-------+--------—-—- I 
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STATUS ID ENTRY NAME 
FIELD FIELD TYPE FIELD 
FIELD 

|------ |------ | ----|-----~---------- I 
| | |ATOM| LEFT | RIGHT | 
| | |----| CHILD | CHILD | 
| | |NODE| l | 
7 |------ |------ | ---- ]|--------------- | 
WORD n+O [| B I 1 NI 6 | 2 | 
|------ |------ | ----|------- |------- ; 
WORD n+1 | B | 2 Z| NI 3 | 4 =| 
|------ |------ | ----|------- |------- I 
WORD nt2 | B I| 3 | NI 7 | 5 | 
| ------ |------ |---- ]-------|------- | 
WORD nt+3 | B | 4 | NEI 8 Ff nil | 
| ------ | ------ | ---- |------—- | ------- | 
WORD nt+4 | B [| 5 JF NI 9 fF nil | 
|------ |------ | ----]|------- |------- | 
WORD nt+5 | B | 6 | A : A . 
WORD n+6 | B | 7 : A 7 B 7 
WORD nt+7 | B IT 8 | Al D | 
|------ |------ | ---- | --------------- | 
WORD n+8 | B | 9 | A | Cc | 


—a ew ew ena 


Figure 4-1 Associative Linked List 


Note that with this type of memory 
organization, the CAR, CDR and CONS functions are 
as easy to implement as with conventional linked 
lists. The primary difference is that this 
organization is best suited for associative 
memories since the data is accessed via a_ key 
(node ID in this case) and not by a_ memory 
address. Since associative memories can access 
the entire memory in the same amount of time as 
one word, the data does not have to be organized 
(i.e. sorted) to achieve retrieval efficiency. 
When records (i.e. node entries) are no longer 
needed, the status field of the word is simply 
marked idle and is thus available for reuse by the 
next memory access seeking an idle word. Complex 
garbage collection is not needed. 


CDAR Codes 


The associative tree 
associative memories but 
chains of CAR 
abbreviated 
sequentially. 
in Figure 5-1 


organization uses 
still requires that 

and CDR functions (henceforth 
CDAR functions) be executed 
Another data representation shown 

allows any sublist which can be 
defined by a CDAR function to be searched for 
directly and in parallel. Thus with this 
representation, all CDAR functions can be executed 
in a constant amount of time. 


This storage technique uses a CDAR code 
illustrated in Figure 5-1 designed so that numeric 
range searches can be used to search for sublists. 
Thus if the list ((A B (C D) (((E) F))) G) is to 
be processed by the function CDDAR, the function 
string is first converted into the CDAR code 011. 
Then, the lower bound of the search is obtained by 
adding zero fill, the upper bound by adding one 
fill. Thus in this example, the CDDAR of the list 


Let O=CAR, 1=CDR, left justify with 1 fill 
(order of application is left to right) 
then LIST = ((A B (C D) (((E) F))) G) is 


LIST CDAR 

NAME CODE ATOM 
LIST 0011111111111111 A 

LIST 0101111111111111 B 

LIST 0110011111111111 C 

LIST 0110101111111111 D 

LIST 0111000011111111 E 

LIST 0111001011111111 F 

LIST 1011111111111111 G 


Figure 5-1 CDAR Encoding 


shown in Figure 5-1 is obtained by selecting all 
elements greater than or equal to 011000000000000 
and less than or equal to 0111111111111111. These 
elements, C, D, E, and F, form the sublist ((C D) 
(((E) F))). : 


Figure 5-2 gives the algorithm for generating 
the CDAR code for a list from list input. For 
simplicity, it is assumed that the input string 
has been scanned and the items have been broken 
out and stored in the associative memory field 
ITEM in a manner such that index I will reference 
them in the proper order. 


Basically, the algorithm contains a scan and 
a generate procedure. The scan procedure 
identifies the next item in the list. If the 


item is a left parenthesis, the level of the tree 
is incremented by one and the number of nodes on 
that level is initialized to zero. If the item is 
an atom, the appropriate CDAR code is generated 
and associated with the atom. After the code and 
atom have been associated, the count of nodes for 
the current level is incremented. If the item is 
a right parenthesis, the appropriate CADR code is 
generated and associated with a ‘nil’ symbol (This 
marks the end of a substring). The level of the 
tree is decremented and the number of nodes on the 
lower level is incremented. 

The CDAR code generation function simply 
generates a string of code for each level. The 
code consists of 1 one for each node ona _ level 
terminated on the right with a zero. The codes 
from the levels are concatenated from right to 
left (highest level to lowest) with one fill on 
the right. 


The CAR and CDR functions are of course, 
special cases of the more general CDAR function. 
The CONS function is equally easy to implement 
with the CDAR code approach. If the list Ll = 
(C (D) E), stored in memory as shown in Figure 5- 
3a, is to be CONSed to the list (A B), Figure 5- 
3b, the process is simply one of appending a zero 
to the front (left) of the CDAR code for L2 = (A 
B), appending a1 to the front of the codes for 
Ll, and changing the list name of LI to L2 (See 
Figure 5-3c). If a new list is being generated, 
the elements of L1 and L2 would be copied before 
modification and both list names would be changed. 
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PROCEDURE SCAN 


TYPE 
CDARRECORD = RECORD 
CDARCODE: CODETYPE 
ATOM: ATOMTYPE 
LISTNAME: NAMETYPE 
END 
VAR 


CDARMEMORY: ARRAY[1..m] OF CDARRECORD 
NODECT: ARRAY[1..n] OF INTEGER 
FUNCTION GENERATE 
CONST 
ACODE = 0 
DCODE = 1 
BEGIN (* GENERATE *) 
GENERATE := -1 (* SET TO ALL ONES *) 
FOR K := LEVEL TO 1 DO 
BEGIN 
GENERATE := RIGHTSHIFT (ACODE, GENERATE) 
(* RIGHTSHIFT SHIFTS THE SECOND ARGUMENT RIGHT 
ONE BIT AND SHIFTS THE FIRST ARGUMENT INTO 
THE LEFT MOST BIT *) 
FOR L := NODECT(K) TO 1 DO 
GENERATE := RIGHTSHIFT (DCODE, GENERATE) 
END 
END (* GENERATE *) 
BEGIN (* SCAN *) 
LEVEL := 0 
J :=1 
FOR I = 1 TO ENDI DO 
CASE ITEM(I) OF 


LEFTP: 
BEGIN 
LEVEL := LEVEL + 1 
NODECT(LEVEL) := 0 
END 

ATOM: 
BEGIN 
CDARCODE(J) := GENERATE 
ATOM(J) := ITEM(I) 
LISTNAME(J) := NAME 
JI 3:= J+1 
NODECT (LEVEL) := NODECT(LEVEL) + 1 
END 

RIGHTP : 
BEGIN 
CDARCODE(J) := GENERATE 
ATOM(J) := NIL 
JI:=J +1 
LEVEL := LEVEL - 1 
NODECT (LEVEL) := NODECT(LEVEL) + 1 
END 

END 

END. 


Figure 5-2 CDAR Code Generation Algorithm 


NAME CODE ATOM NAME CODE ATOM NAME CODE ATOM 


meee EE ee cee Gee ee eee 


Ll 01111 C I2 01111 A L2 00111 A 
L1 10011 D L2 10111 B Il2 01011 B 
Ll 11011 E . L2 10111 C 
I2 11001 D 
Il2 11101 E 

a -— List b - List c — List 
(C (D) E) (A B) ((A B) C (D) E) 


Figure 5-3 List Concatenation 
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Explicit Parenthesis Storage 


The Explicit Parenthesis Storage (EPS) 
technique associates the list structure with the 
atoms by explicitly saving the left and right 
parenthesis. Figure 6-1 shows a _ typical EPS 


associative record. The record contains the name 
of the list, the name of the atom, the number of 
left parenthesis in the list preceding the atom, 
the number of right parenthesis preceding and 
immediately following the atom and the position of 
the atom in this list. 


NUM NUM 
LIST LEFT RIGHT POSITION 
NAME ATOM PARAN PARAN NUMBER 
FIELD FIELD FIELD FIELD FIELD 


Figure 6-1 An EPS Record Format 


The algorithm EPS 
6-2. 
code 

the 


for generating an 
representation of a list is shown in Figure 
The algorithm is quite similar to the CDAR 
generation algorithm (Figure 5-2) except that 
left and right parenthesis counts (NLP and NRP 
respectively) are saved as part of the data 
directly. The algorithm is straight forward and 
simply counts left and right parentheses as they 
are encountered. However, since the right 
parentheses count of an atom in position n 
includes the right parentheses up to the atom in 
position ntl, the calculation of the NRP values 
lag behind one iteration. Thus the NRP value for 
the last atom is stored after the entire list has 
been processed. Figure 6-3 shows a list and its 
EPS representation. 

In this representation of a list, the list or 
sublist of interest is delineated by specifying 
the lowest and highest position number in the 
list. Figure 6-4 and 6-5 give the algorithms for 
the CAR and CDR functions respectively. These 
algorithms assume that the global variables LOWEST 
and HIGHEST delineate the list position numbers on 
entry and they update the variables and adjust the 
NLP and NRP values accordingly. The CAR is found 
by setting the HIGHEST variable to the left most 
atom (i.e. lowest position number) with sufficient 
right parenthesis to balance the first atom in the 
list. The CAR is obtained by finding the same 
atom, but setting the LOWEST variable to the next 
highest value. The remaining statements adjust 
the NLP and NRP counters accordingly. 


The CONS function is accomplished by adding 
one to the left parentheses count of the elements 
of the first argument and then adding the _ total 
number of NLP, NRP and POSN of the first argument 


to the NLP, NRP and POSN of the elements of the 
second argument. Figure 6-6a, b and c give an 
example. Figure 6-7 gives the algorithm. 


LPCNT := 0 
RPCNT := 0 
J :=1 
K :=1 


FOR I := 1 TO ENDI DO 
CASE ITEM(I) OF 


LEFTP: LPCNT := LPCNT + 1 
ATOM: 
BEGIN 
LISTNAME(J) := NAMEOFLIST 
ATOM(J) := ITEM(I) 
NLP(J) := LPCNT 
(*DEFINE RANGE OF RP TO INCLUDE 0 *) 
NRP(J-1) := RPCNT 
POSN(J) := K 
K :=K+i1 
JI:=J+i1 
END 
RIGHTP: RPCNT := RPCNT +1 
END 
END 


NRP(J-1) := RPCNT 


Figure 6-2 List to EPS Transformation Algorithm 


NUM NUM 
LIST LEFT RIGHT POSITION 
NAME ATOM PARAN PARAN NUMBER 
FIELD FIELD FIELD FIELD FIELD 
(NLP) (NRP) (POSN) 


LISTC C 2 0 1 
LISTC D 2 1 2 
LISTC A 2 1 3 
LISTC B 3 2 4 
LISTC E 3 3 5 


Figure 6-3 EPS Representation of ((C D) A (B) E) 


BEGIN (* CAR *) 
IF LOWEST <= POSN(*) AND POSN(*) <= HIGHEST TH 

HIGHEST := MINIMUM(POSN(*) , NLP(*)-1<=NRP(*) ) 
IF LOWEST = HIGHEST THEN 


NRP(HIGHEST) :=NRP(HIGHEST) -— 1 

IF LOWEST <= POSN(*) AND POSN(*) <= HIGHEST THEN 
NLP(*) := NLP(*) - 1 

END (* CAR *) 


Figure 6-4 CAR Algorithm for EPS 


BEGIN (* CDR *) 

LPCNT := NLP(LOWEST) - 1 

IF LOWEST <= POSN(*) AND POSN(*) <= HIGHEST THEN 
LOWEST := MINIMUM(POSN(*) , NLP (*)—1<=NRP(*) ) 

RPCNT := NRP(LOWEST) 

LOWEST := LOWEST + 1 

IF LOWEST <= POSN(*) AND POSN(*) >= HIGHEST THEN 


NLP(*) := NLP(*) — LPCNT 
NRP(*) := NRP(*) - RPCNT 
END 

END. (* CDR *) 


Figure 6-5 CDR Algorithm for EPS 
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NUM NUM 
LIST LEFT RIGHT POSITION 
NAME ATOM PARAN PARAN NUMBER 
FIELD FIELD FIELD FIELD FIELD 
(NLP) (NRP) (POSN) 


LISTB C 1 0 1 
LISTB OD 1 1 2 


Figure 6-6a The EPS Representation of LISTB 


NUM NUM 
LIST LEFT RIGHT POSITION 
NAME ATOM PARAN PARAN NUMBER 


FIELD FIELD FIELD FIELD FIELD 
(NLP) (NRP) (POSN) 
LISTA A 1 0 1 
LISTA'  B 2 1 2 
LISTA E 2 2 3 
Figure 6-6b The EPS Representation of LISTA 
NUM NUM 
LIST LEFT RIGHT POSITION 
NAME ATOM PARAN PARAN NUMBER 
FIELD FIELD FIELD FIELD FIELD 
(NLP) (NRP) (POSN) 
LISTC C 2 0 1 
LISTC D 2 1 2 
LISTC A 2 1 3 
LISTC B 3 2 4 
LISTC E 3 3 5 


Figure 6-6c The EPS Representation 
of (CONS LISTB LISTA) 


PROCEDURE CONS(LOWESTB, HIGHESTB, LOWESTA, HIGHESTA) 

BEGIN (* CONS *) 

IF LOWESTB <= POSN(*) AND POSN(*) <=HIGHESTB THEN 
NLP(*) := NLP(*)+1 

IF LOWESTA <= POSN(*) AND POSN(*) <= HIGHESTA THEN 
BEGIN 


NLP(*) := NLP(*) + NLP(HIGHESTB) 
NRP(*) := NRP(*) + NRP(HIGHESTB) 
POSN(*) := POSN(*) + POSN( HIGHESTB) 
END 


END (* CONS *) 


Figure 6-7 CONS Algorithm 


Conclusion 


Three different techniques of storing logical 
list structures in associative devices for 
efficient processing are described. Preliminary 
analysis indicates considerable speed up in 
pattern matching can be achieved if conventional 
LISP and PROLOG implementations were to . be 
implemented. While this avenue should be 
explored, it would appear that even greater speeds 
are possible by implementing the searching 
functions inherent in these languages at a more 
direct level. 
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DETERMINATION OF THE ROTATIONAL AND TRANSLATIONAL COMPONENTS OF A 
FLOW FIELD USING A CONTENT ADDRESSABLE PARALLEL PROCESSOR 


M. E. Steenstrup, D. T. Lawton, C. Weems 


Department of Computer and Information Science! 
University of Massachusetts at Amherst 


Abstract 


The realization of motion perception in artificial 
systems will require highly parallel architectures. Here 
we demonstrate the use of a Content Addressable Parallel 
Processor (CAPF) [1,2] as an effective means of quickly 
and accurately decomposing a flow field into its rotational 
and translational components [3] to recover the parameters 
of sensor motion. 


Organization of the CAPP 


The CAPP is a VLSI-based Single Instruction Multiple 
Data (SIMD) machine designed at the University of 
Massachusetts [4]. It consists of a parallel processor 
containing 512x512 cells and a central controller. The 
central controller issues instructions to the parallel 
processcr, controls loading and unloading of data in the 
parallel processor, and serves as an interface to the host 
computer and to secondary storage devices. It broadcasts 
data to the parallel processor bit serially, and the entire 
memory may be bulk-loaded in one video frame time 
(1/30 second). The central controller contains a set of 
micro-coded subroutines in ROM for performing high-level 
CAPP routines and a writeable control store for adding 
microcode. 


The parallel processor consists of an €&x8 array of 
processing circuit boards and a set of boards which 
control CAPP edge treatment. Each processor board, in 
turn, consists of an &x8 array of special purpose CAPP 
IC chips plus random buffer logic. Each chip then 
contains 64 cells, an instruction decoder, and some 
miscellaneous logic. There are eight basic instruction 
types recognized by the chip, each performed in parallel 
by the constituent cells. Most instructions take one minor 
cycle time (100 nanoseconds) to execute. Inter-cell 
communication is bit serial and is accomplished by a 
four-way (N, S, E, W) cell interconnect network, aliowing 
for three types of edge treatments: dead-edging, circular 
wiap, and zig-zag wrap. 


1. This research was supported by DARPA under Grant 
N00014-82-K-0464. 
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Each unit cell consists of 64 bits of fully static 
memory, four one-bit static "tag” registers A, B, X, and 
Y, a static carry bit register Z, and an ALU which 
continuously generates X NAND Y, X NOR Y, and X 
+ Y + Z. Also, each cell contains logic for selecting a 
data source (a register (excluding Z), memory, an ALU 
function, broadcast data, or a neighboring cell (CN, S, E, 
or W)), possibly inverting the selected signal, and storing 
it in a destination (a register or memory). The X 
register is the main tag register. Its output is connected 
to Some/None logic, indicating cell response, and to the 
neighbor communication network. The A register controls 
whether or not a cell is active. An inactive cell ignores 
all but a small set of instructions broadcast by the central 
controller. The Y register provides a secondary store for 
tag bits, while the B register provides a secondary store 
for activity bits. 


Flow Field Decomposition Procedure 


Our algorithm is an exhaustive search procedure which 
uses a set of rotational and translational flow field 
templates to find a component pair which can account 
for the motion depicted in a given flow field. Currently, 
1000 rotational templates and 200 translational templates 
are used. These are generated from 100 axes which are 
uniformly distributed with respect to a unit hemisphere, 
and all pass through the origin of the sensor coordinate 
system. Each flow field consists of 16x16 vectors and is 
stored on a 2x2 square of chips consisting of 256 cells. 
The 2x2 chip arrangement facilitates flow field addressing. 
Each cell contains the horizontal and vertical components 
of a flow vector, each specified with 10 bits of precision. 


The algorithm consists of four basic steps. 


(0) The rotational templates are loaded into the CAPP, 
one template for each flow field location. Each flow 
field location corresponds to one of the squares in the 
CAPP diagrams shown in Figures 2a, 2b, and 2c. The 
rotational templates need only be loaded once since they 
are used in determining any flow field decomposition. 


(1) A copy of the input flow field is loaded into each 
flow field location in the CAPP. Figure la and 1b show 
two sample input fields, both produced by the same 
motion and environment, except that Figure ib was 
produced by adding random spike noise to Figure 1a. 


(2) A set of difference fields is formed by subtracting 
each rotational template from the copy of the input flow 
field stored with it. For each resulting difference field, 
the slope of each difference vector is computed by 
dividing the vertical component by the horizontal 
component. These subtraction and division procedures are 
performed in parallel across all flow fields represented in 
the CAPP. 


(3) The similarity between the difference fields and 
each of the translational templates is evaluated, 
proceeding sequentially through all the translational 
templates. For a_ given translational template, this 
comparison is done in parallel with all difference ficlds 
stored in the CAPP and consists of the following steps: 


(3a) The slope of each component vector of the 
translational template is loaded into the corresponding 
vector location of each difference field. The sign of the 
slope of each difference vector is compared with the sign 
of the slope of the corresponding translational template 
vector. If the signs agree, the absolute value of the 
difference between the slope of the difference vector and 
the slope of the translational template vector is computed, 
and then scaled according to the absolute value of both 
slopes. If the scaled slope difference does not exceed a 
predetermined maximum error value, then a vector match 
is designated at that position. The quantity of error 
permitted here allows the algorithm to be resistant to 
uniformly distributed Gaussian noise of low variance 
present in the original flow field. 


(3b) For each difference field the number of vector 
slope matches is counted. If this sum exceeds a 
predetermined minimum number of matches (in our 
implementation, 75% of the field size), then the associated 
rotational and translational templates become a candidate 
pair for the flow field decomposition. Utilization of a 
minimum number of required matches ensures that only 
templates which are reasonably close to the actual motion 
will be chosen and permits some resistance to random 
spike noise. Figure 2a shows, for difference fields 
resulting from the input field in Figure la, the CAPP 
response to the translational template which is closest to 
the actual translational motion. Each black dot within a 
square represents a position in a difference field at which 
the slope of the difference vector matches the slope of 
the translaticnal template. Figure 2b shows, for 
difference fields resulting from the input field in Figure 
1b, the CAPP response to the translational template 
which is closest to the actual translational motion. Figure 
2c shows the CAPP response to a translational template 
which is not close to the actual translational motion. 
This translational template is shown in Figure 3. 


(3c) For all difference fields yielding at least the 
required minimum number of matches, the variance of 
the scaled slope difference is computed, and the 
difference field with the minimum variance is determined. 
This value is compared to the minimum variance found 
from processing the preceding translational templates. If 
this value is less than the preceding minimum, it becomes 
the new global minimum, and the rotational template 
associated with the difference field together with the 
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current translational template become the current best 
candidate pair for the flow field decomposition. 


Steps 3a, 3b, and 3c are performed for each 


translational template. 


(4) The flow fieid decomposition considered to be the 
best is the rotational and translational template pair 
resulting in the difference field yielding at least the 
required minimum numbcr of matches and the least slope 
difference variance. Utilizing minimum variance instead 
of the maximum number of matches, the algorithm has 
achieved better results, particularly for motions whose 
component parts lie between sets of templates. Figures 
4a and 4b show the rotational and translational templates 
selected by the algorithm in the presence of and in the 
absence of noise, for the input fields in Figures la and 
1b. These templates are the closest ones to the actual 
motions. Figures 5a and 5b show the difference fields 
resulting from subtracting the rotational motion in 4a 
from the original fields in Figures la and 1b respectively. 


Experiments 


Experiments have been performed with a CAPP 
simulator on a VAX 11/780 using a wide variety of 
motions and simulated environments. In all cases 
examined, the translational template closest to the actual 


_ translational motion was selected. The rotational template 


was always close to the actual rotational motion, but was 
sometimes not the closest template. The procedure 
proved to be resistant to limited Gaussian noise as well 
as to limited random spike noise in the original flow 
field. Applying motion to points at random depths 
produced results similar to those obtained in the noise 
experiments. The algorithm’s performance degraded 
slightly if each flow vector component was specified by 
eight bits of precision instead of by ten. 


The CAPP timing calculations revealed that the 
algorithm could perform the  rotational-translational 
decomposition in slightly more than 1/4 second. If two 
CAPPs are used in parallel, then the time can be 
reduced to less than 1/5 second, since only half of the 
translational templates need be tested on each CAPP. 
Given fabrication techniques available in the immediate 
future, we expect execution times to be significantly 
improved. We also suspect that performance will improve 
by increasing both the number and size of the rotational 
and translational templates. This amounts to utilizing 
more CAPPs in parallel. 
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ABSTRACT 


The reliability analysis of computer communi- 
cation networks is generally based on Boolean alge- 
bra and elementary probability theory. Several in- 
teresting reliability problems of computer networks 
include terminal-pair connectivity, tree connectivi- 
ty, and multi-terminal connectivity. Traditionally, 


attempts were made to compute only the point and 


average reliabilities for networks because of the 
computational complexity involved. In this paper, 
an attempt is made to perform the dynamic 
analysis of reliability problems of computer net- 
works. Time-dependent expressions for reliability 
measures are derived assuming Markovian behavior 
for failures and repairs. The advantages of the pro- 
posed methods are: the provision for incorporation 
of different distributions for failure and recovery 
times, computation of task and mission related 
measures such as Mean Time to First Failure (MTFF) 
and Mean Time Between Failures (MTBF), and net- 
work design based on the dynamic behavior. The 
advantages of dynamic reliability analysis is ilus- 
trated by a detailed study of the bridge network. 


1. INTRODUCTION 


With the increased interest in resource shar- 
ing, computer networking and distributed process- 
ing is becoming increasingly popular [ROBE 70]. 
Computer networks are being employed in many 
different applications such as distributed precess 
control, electronic banking, defense systems, etc. 
Since such applications usually demand very reli- 
able operation, redundant computers and commun- 
ication links are incorporated in the design of the 
computer networks. Reliability modeling and 
analysis of such computer communication networks 
has drawn the attention of many researchers for a 
number of years [HANS 72, FRAT 73, GRNA 79]. 
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The reliability models used for computer 
communication networks were based on only one of 
the following techniques: the discrete-state Markov 
chains, Boolean algebra or simple probability theory 
[FRAT 73, GRNA 80a;. Most of the studies were con- 
cerned with computation of only the point and 
steady state reliabilities for networks because of 
the computational complexity involved [BALL 80] 
There seems to be no analytic methods reported in 
the literature for studying the dynamic behavior of 
reliability problems of computer networks. 


Consider the computer communication net- 
work shown in Figure 1. In such a computer net- 
work, there are several reliability problems. The 
probabilistic events of interest are: 


e Terminal-pair connectivity 
e Tree (broadcast) connectivity 
e Multi-terminal connectivity 


These reliability problems depend on the network 
topology, distribution of resources, operating en- 
vironment, and the probability of failures of com- 
puting nodes and communication links. The compu- 
tation of the reliability measures for these events 
require the enumeration of all the paths between 
the chosen set of nodes. The complexity of these 
problems, therefore, increases very rapidly with 
network size and topological connectivity. 


Terminal-pair connectivity is useful because 
most applications of computer networks require 
connection between two nodes over a period of 
time; for example, in remote interactive computing. 


Several methods have been proposed in the litera- 


ture to analyze reliability of terminal-pair connec- 
tivity [FRAT 73, FRAT 76, FRAT 78, GRNA 79, GRNA 
80a, HANS 72, HANS 74]. The tree connectivity 
problem has not been dealt with in the literature, 
and is useful in studying the reliability of successful 
broadcasting of information by a central controller 
to a set of nodes in the network. 


In a distributed processing system where 
resources such as files and programs are distribut- 
ed among many computers, the successful comple- 
tion of a task generally requires that several sites 
should be up, and communicating with each other. 
Execution of a task may require access to several 
files residing at different sites and communication 
paths between several node pairs. The probability 
of successful execution of a task is therefore more 
complex and useful than the terminal-pair connec- 
tivity in a distributed communications network. To 
handle this problem, we are interested in the event 
multi-terminal connectivity which was introduced 
in [GRNA 81]. The multi-terminal connectivity 
reflects fairly accurately the survivability of distri- 
buted systems with redundant processor, data base 
and communication resources [HILB 80, MERW 80, 
GRNA 81]. 


In this paper, an attempt is made to study 
dynamic or time-dependent analysis of the various 
connectivity problems of computer networks. We 
consider two different operating environments for 
computer networks, namely, closed or  non- 
repairable, and repairable. The advantages of 
dynamic reliability analysis are: the provision for in- 
corporation of different distributions for failure and 
recovery times, computation of task and mission re- 
lated measures such as mean time to first failure 
(MTFF) and mean time between failures (MTBF), and 
network design based on the dynamic behavior. 


In section 2 we explain the reliability prob- 
lems of interest, define useful reliability measures, 
and summarize results of previous work. Section 3 
deals with detailed description of the methodology 
proposed for analyzing dynamic behavior of com- 
puter networks. In section 4 we illustrate the 
methodology of dynamic reliability analysis by a de- 
tailed study of the bridge network. 


2. DYNAMIC RELIABILITY ANALYSIS 


For reliability analysis, a computer network 
or a distributed processing system is usually 
represented by a graph G(V,E) where V and E are 
respectively, the set of nodes (representing the 
computers) and the set of directed or undirected 
arcs representing the communication links. The 
number of nodes is N=|V| and the number of links is 

=|E}. The links and nodes are labeled as z,’s and 
%+p'S respectively. The component set of the net- 
work is given as C=§x,, --°,2z,,.2r41, '°°, Zp+n}. In 
the static reliability analysis, the processing nodes 
and the communication links are associated with 
reliabilities, i.e., probabilities of being operational. 
The reliability of i-th element is given by, 


p, = Pr(i” element is working) 


GQ=1-py 
It is generally assumed that there is no correlation 
between failures of different links and nodes, ie., 
the probability of failures of the elements are sta- 
tistically independent. Further, it is assumed that 
the characterization of individual element failure 
behavior is sufficient to perform reliability analysis. 
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Most researchers performing computer net- 
work reliability analysis assume that the component 
(link or node) reliabilities p; are constant during the 
time interval in which the reliability of the network 
is being examined. Additionally, no distinctions are 
made between the reliability and availability of net- 
works under different operating environments. For 
example, in [FRAT 73), an average network 
terminal-pair reliability is defined and computed 
for a repairable network in which the individual 
components are assumed to be undergoing failure 
and subsequent repair-(or recovery). The expres- 
sion derived for this measure in actuality 
represents a Steady-state terminal-pair availability. 


In this section, we will attempt to clarify the 
terminology used in the network reliability prob- 
lems by giving precise definitions Of the measures. 
The events whose probability of success is of in- 
terest in a network are: terminal-pair connectivity, 
broadcast connectivity, multi-terminal connectivity, 
etc. Occasionally, these events are also required to 
satisfy some performance constraints specified by 
the user. The most common constraints include de- 
lay (time delay or hop length), flow (capacity or 
throughput), and survivability of distributed pro- 
grams and data. 


The network may be operated as a closed sys- 
tem, i.e., no repair of failed elements (nodes and 
links) is possible during the time interval of in- 
terest. If the failed network elements are repaired 
and made operational while the rest of the network 
may be still providing acceptable level of service, 
we are interested in the gain in the probability of 
successful completion of the events defined above. 


We can define two representative states taken 
by each component and also by the system (i.e., the 
network): operating and failed. The component 
failure behavior is simple to understand. The sys- 
tem is said to be failed if al any time instant it can- 
not maintain a specified level of service (an event 
under some constraints). The dynamic reliability 
measures which are found very useful in the design 
and evaluation of computer networks and distribut- 
ed systems are defined below (these measures are 
defined for computer systems in books on reliabili- 
ty) [SIEW 82]. 


Reliability: Given that the network was fully opera- 
tional (all the components operating) at time t=0, 
the reliability of the system (R(t)) is the probability 
that it continues to provide the specified level of 
service at time t=T. There may be many failures of 
components, but the network remains operational 
throughout the interval [0,T]. 


Mean Time to First Failure: The MTFF of the net- 
work is the average time it takes for the network to 
enter the failed state (ie., failure to satisfy the 
specified service request) for the first time, given 
that it was fully operational at time t=0. In the con- 
text of computer networks, an example of service 
for which MTFF is of interest is file transfer between 
a source node and a destination node. 


Note that definitions of R(t) and MTFF apply to both 
closed and repairable networks. 


Availability: Given that the network was initially (at 
time t=0) working in full configuration, the availa- 
bility (A(t)) is the probability that the network is 
successful in providing the specified level of service 
(completing an event under a constraint) at any 
time instant t=T. The network might have under- 
gone one or more failures during the interval (0,T), 


but it was made operational again by repairing or- 


replacing the failed elements. 


Mean Time Between Failures: For repairable net- 
works, MTBF is the average time between two sys- 
tem level failures. MTBF for the network will be 
higher than that of a single component. 


Steady State Availability: The equilibrium or steady 
state availability (SA) of a network gives the long 
term probability of maintaining the specified level 
of service given that the repair is provided on 
demand throughout the lifetime of the network. it 
is a measure of the fraction of time the communica- 
tion system is able to exchange information 


between a set of nodes. 


Figure 1. A Typical Computer Network 


several methods are reported in the litera- 
ture for terminal reliability analysis and computa- 
tion using a Boolean algebraic approach [FRAT 79, 
FRAT 76, FRAT 78, GRNA 79]. These methods start 
by considering all the simple paths between a given 
pair of nodes, and then performing some Boolean 
operations to arrive at the Boolean expression for 
the probabilistic event of interest. This expression 
is then used to obtain a terminal reliability expres- 
sion by using the corresponding network element 
reliabilities. As an example, in the computer net- 
work shown in Figure 1, there exists three different 
paths between the source 5 and the destination T. 
We assume that the nodes are perfectly reliable and 
write the paths in terms of the links. The paths are: 
1) x \2erg, 2) 24%52%g, and 3) x x7"g. The terminal reli- 
ability between S and T is given by the probability of 
the event 7 


P(path 1 up) + P(path 1 se *P(path 2 up) 
+ P(paths 1 & 2 down)*P(path 3 up) 


The Boolean expression for this event is given by, 


Z Lela + Vy XQXy_g LaX5Xg t+ Log Tals VX gry 
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A highly efficient algorithm for terminal relia- 
bility analysis has been proposed by Grnarov et al., 
[GRNA 79, GRNA 80a]. This algorithm for symbolic 
reliability analysis is based on the representation of 
simple paths by “cubes” (instead of prime impli- 
cants). The algorithm introduces a new operation 
for manipulating the cubes, and the interpretation 
of resulting cubes in such a way that Boolean and 
arithmetic reduction are combined. The details of 
the derivation of the algorithm can be found in 
[GRNA80a]. The symbolic terminal reliability ex- 
pression for the above example can be obtained as: 


R(S>T) = pipeps + (1-PiPaps)PsPsPe 


+ (1—paps)(1—PaPs)P PeP7 (1) 
where ST represents 5-T connectivity. 


In a distributed processing system with 
redundant resources, we will be interested in 
multi-terminal connectivity, which is needed when 
running a program at a site that requires files resid- 
ing at different sites |HILB 80, MERW 80, GRNA 81). 
In [GRNA 81] an efficient algorithm was proposed for 
multi-terminal reliability analysis and is based on 
the derivation of a Boolean function for multi- 
terminal connectivity. This algorithm is an exten- 
sion of the algorithm presented in [GRNA 80a, GRNA 
80b| to handle both reliability computation and 
symbolic reliability analysis. 


Another reliability measure of interest in 
studying computer networks is tree connectivity. 
The probabilistic event of interest is that there ex- 
ists at least one path from a particular node to a set 
of nodes. This is useful in studying the reliability of 
broadcasting of information from a given node to a 
set of nodes. For example, referring to Figure 1, we 
might like to evaluate the reliability of connection 
from node 8 to nodes 11, 12, and 13 at the same 
time. The evaluation is more than a simple multipli- 
cation of terminal reliabilities, since there are some 
dependencies. Actually, we can perform a Boolean 
AND operation on the set of paths for each node 
pair and obtain the Boolean expression for the tree 
connectivity. The source 5 can successfully broad- 
cast information to nodes 11, 12, and 13 if all of the 
following events are true: 1) 2,29, 2) 2,27 + 245, and 
3) wyworg + 2 L7eeg + V4Tseg. 


We first perform the Boolean AND operation on 
these three set of paths: 


(x %2)(a eq + ByX5)(T org + V L7g + VyT5%Q) 
After simplifying the Boolean expression becomes: 


Ey Lele, + Ly Xelgt, + LXer glars + LT Ler~Terg 


The corresponding reliability expression can then 
be obtained by using Grnarov’s algorithm as: 


R(S >§11,12,133) = pypopap, + (1—ps)p ipepepr 


+ (1—p7)PpWepapaps + (1—ps)(1—pr)pwepapspe _— (2) 


In the next section, we show the method of 
deriving the time-dependent reliability and availa- 
bility expressions by the extension of the Boolean 
approaches discussed above. This method of 
dynamic analysis will be shown to be equivalent to 
the analysis of the Markov modeling approach used 
to study fault-tolerant computers. 


3. DESCRIPTION OF THE METHODOLOGY 


The following notions are important to keep in 
mind when performing the dynamic reliability 
analysis. First, the network reliability measures 
should always be qualified with i) the probabilistic 
event for which the measures are evaluated, ii) per- 
formance constraints, and iii) the time interval. 
For example, the reliability and availability are 
denoted as: 


R(event, constraint, time 
A(event, constraint, time 


The first argument should always be explicit. The 
second argument may be absent if no performance 
constraints are specified by the user. The third ar- 
gument is not needed for the following measures. 


MTFF(event, constraint 
MTBF(event, constraint 
SA(event, constraint) 


For example, we may be interested in finding time T 
for which the availability of terminal-pair connec- 
tivity is greater than, say, Ap. To include perfor- 
mance constraint, we may want to find the 
terminal-pair reliability R(t) such that the message 
delay between source and destinaltion is less than 
do. 


The second notion is to describe more realist- 
ically, the reliability behavior of the individual net- 
work elements. A single probability of success p; 
for the i-th element is inadequate. In addition to the 
topological parameters such as the connection ma- 
trix for the network, we need the failure rates and 
repair rates for the nodes and links. Under the Pois- 
son assumptions for the arrivals of failures and ex- 
ponential distribution for the repair times, the reli- 
ability and availability of i-th element can be ex- 
pressed as 


R(x,t)=e 
[i di —(0y uy )t 
A(y,t) = ———— + ~——e 
Soe as OE ST 772 


where >; and yw, are the constant failure rate and 
repair rate (Mean Time To Repair, MTTR, is 7 


respectively of the i-th element. The expressions 
for MTFF, MTBF, and SA are simple functions of i; 
and ju;. 


MTFF (x) = [ R(z,,t)dt = — 
0 hi 
[i 
SA(2,) = A(z;, 0) = ———— 
\ (Ay +L; ) 
____MITR 
MTFF + MTTR 


MTBF(z,) = MTFF + MTTR = 2-+ 
Mo 
If the element failures and repairs are described by 
general probability distribution functions, we have 
to resort to Laplace transform techniques to solve 
for the reliability measures of network elements. 


The decomposed reliability model of a net- 
work obtained by applying the Boolean algebra rules 
on the path sets for a given event can now be 
transformed into a time-dependent model by sub- 
stituting p. with R(z,,t) for a non-repairable net- 
work, or with A(z;,t) for a repairable network. 
Therefore, for the non-repairable network of Figure 
1, referring to equation 1, the time-dependent relia- 
bility can be written as: 


R(S > Tt) =e ~(A, tAgtAg)e rs (1 cag —(A, tAgtAg)t e —(Agtrs tAg)t 
lize —(AgtAg)t \(-e —(AgtAg)t Je —(AytAgtAy)t 
=e “Oy +AgtAg)t +e —(Ag tAgtAg)é +e —y +AgtAg)t 


a g ArtAgtagtAgt Art eg (Ay tAgtAg try tant 


6 7 
~(Yae (YA) 


(3) 


Figure 2. Markov State Transition-Rate Diagram 


It can be shown that the above equation is exact and 
that the analysis is equivalent to the Markov relia- 
bility analysis technique. A Markov — state 
transition-rate diagram shown in Figure 2 can be 
developed for the network of Figure 1. Figure 2 
shows the states in which the network satisfies the 
S-T connection and one failure state (state #8). The 
transitions between the states represent failure of 
links. One interesting point to be noted here is that 
the failure of certain links eliminate some links in 
series (eg., x2 eliminates x3). The failed elements do 
not contribute to either the reliability or unreliabili- 
ty of the network. The network reliability is given 
as the sum of the operational state probabilities. 
The state probabilities are obtained by solving the 
following set of linear differential equations. 


P(t) = @ P(t) (4) 
where Q is the state transition-rate matrix and 
P = (P,,P2,P3,P4,Ps,P¢6,P7). The state probabilities for 
each of the operational states in Figure 2 can be ob- 
tained by solving Equation 4 using Laplace 
transforms. 

Pi(s) = 


(AgtAs) Pi (s) 
(s +A, trAgtAstAgtAr) 


(Ag+As) Pi (8) 
(Ss +A, tAgtAgtAgtAr) 


P\(t) + Pelt) + Pa(t) + Pa(t) 


al 


Py{s) = (s +d, +Agt Ay) 


R(S-T,t) 


+ Pst) + Pe(t) + Po(t) 
= g rtAgtAg) a g ratrstrgie - g Aa Ay) 


6 4 
~DAY —-(YA 


+ 


(5) 


For complex networks, clearly, the Markov 
modeling approach becomes very difficult and 
time-consuming because of the state space explo- 
sion. For each probabilistic event considered, the 
number of states in the Markov model is directly 
proportional to the branching factor (for example, 
x, and x,), existence of cross links (for example, z7), 
and the depth of the network (for example, x), x2, 23 
). When availability is needed, we have to expand 
the state diagram to account for the non- 
homogeneity when the repair rates are different for 
different elements. In view of this drawback of the 
Markov modeling approach, the Boolean algebraic 
approach provides an attractive means to achieve 
both efficiency and functionality. The Boolean 
method can be applied to all the reliability prob- 
lems except when the reliability and MTFF are need- 
ed for repairable networks. The. reason is that the 
element reliabilities are not dependent on the 
repair rates whereas the system reliability does. In 
general, for the same event and the same con- 
straint, 


—-€ 


Rewsed (t) < repairable (f) = Arepairable (t). 


It is interesting to note that the number of 
terms in the reliability polynomial (Equation 5) is 
the same as the number of operational states in the 
Markov state diagram (Figure 2). Therefore, we can 
extract the state information by fully expanding the 
time-dependent reliability expression after substi- 
tuting R(i,t) for p, in the symbolic expression (Equa- 
tion 1). The number of operational states does not 
increase exponentially with the number of elements 
because of the dependencies caused by series ele- 
ments. 


4. ANALYSIS OF THE BRIDGE NETWORK 


In this section, we perform a detailed dynam- 
ic reliability analysis of the bridge network shown in 
Figure 3. This network has five links marked 
X,%o,...,%5 which are bidirectional, and four nodes 
marked 2g 27, %g %g. This network is one of the stan- 
dard networks analyzed by many researchers. We 
first use the efficient Boolean technique explained 
in the previous section to obtain a reliability ex- 
pression for the event of interest and then translate 
it to a time-dependent expression by using the 
corresponding element reliability or availability. We 
assume that the i-th link has a constant failure rate 
m and a constant repair rate y,. 


Figure 3. The Bridge Network 


We first analyze terminal pair connectivity 
between S and T. The terminal reliability expres- 
sion for this event using the Boolean algebraic ap- 
proach [GRNA 80a] is: 


R(ST) = pipe + PaPall—PipPe) + PiPsPal1—p2)(i—ps) 


+ PaPsPa(1—pi)(1—Pa) 


Now the time-dependent terminal pair reliability ex- 
pression when no repair is possible is given by, 


PSST Se NON pg I gg OI 


i g AatagtAgit aa —(A, tAgtAgtAs)t 


5 
UAE 


4 
- Say - 
+ 2e +! 


Assuming that all links have the same failure rate A, 
terminal reliability between S and T becomes: 


R(S>T) =p —2rt + 2p ONE o> Se —4rt + 2e ~5\et 


We can calculate the mean time to first 
failure as a function of failure rates using the 
definition of Section 2 as: 


_ 1 1 at See 
MTFF(S*T) = (AK * Wgth) * Otte) 


1 wep Nee eth poet oe 
+ DgtAgtAs) (Ai tAgtAg As) 
See eee 
(Ay tAgtAgtaAg) 
peta iad aes eS a a 
(Ay tAgtAgtas) (Ay tAgtAgtrAytAs) 
Se epee i eee 
(AytrAgtAgtrAs) — (AgtAgtAytaAs) 
When all links have the same failure rate we have: 
By ee gms ch ae 
MITER ST) So Gt 
_ 49 
MTFF(S3T) = Bon 


It is interesting to observe that MTFF for the 38-T 
connection is less than that of a simplex link for 


which MTFF is + This is due to the fact that the in- 


creased hardware complexity of the links involved 
in the communication causes the average time-to- 
failure to reduce. 


We compute the availability for the event of 
terminal pair connectivity between 5S and T with the 
assumption that all links have the same failure rate 
AX and same repair rate yw. The time dependent 
terminal-pair availability expression can be ob- 
tained from the reliability expression given above 
by substituting the following expression for 2. 


= eS eae A 

*  Atu Atp 
A much simplified expression can be obtained for 
steady state availability by substituting a Xe fOr 


e —(A+p)t 


xz, in the reliability expression. That is, 


SA(S>T) = 2a" + 2a3 — 5a* + 2a° 
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The mean time between failures is given by the 
Same expression with a = te 
KN fb 
Next we consider a tree connectivity event, 
which is simultaneous connection between 5 and A, 
and 5S and B. The paths in Boolean terms is: 


S7A = 2, + Lely + VoX4Xe 


SB = Xg + 2125 + 2 )Zor, 


We take the logical AND of these two to obtain Boole- 
an terms for the tree connectivity as: 


SA, Bl = zyeg t+ 2\a5 + V—%5 t+ VT Lor, + LoNgr, 
From this we obtain the reliability expression for 
this event as, 


R(S>{A,B}) = pips + (1—ps)p ips + (1-pi)paPs 


+ (1—p3)(1—ps)p pep, + (1—pi)(1-Ps)Pep@a 


With the assumption that all links have the same 
failure rate A, the time dependent expression for 
this event is: 


The mean time to first failure is then, 


a ae a 
MTFF(S*{A,B}) = S—-— 3-+ 
which simplifies to: 
_ 9 
MTFF(S(A,B3) = 7 


When repair is available, the steady state 
een for this tree connectivity event is given 
y: 


SA(S >{A,B}3) = 3a” — 4a4 + 2a5 


where a = 4— 
At 


The mean time between failures is given by the 
1 


same expression witha = —+ — 
bb 


o. SUMMARY 


Dynamic reliability modeling and analysis of 
computer networks and distributed processing sys- 
tems were presented. A systematic method of ob- 
taining time-dependent reliability expressions for 
various events such as terminal-pair connectivity, 
tree connectivity, and multi-terminal connectivity 
were discussed. The approach uses well known 
Boolean technique to obtain reliability expressions 
and then transforming it to the corresponding 
time-dependent expressions. Other important reli- 
ability measures such as availability, MTFF, and 
MTBF were also studied. A detailed analysis of the 
bridge network was also presented. Further work 
involves in studying these reliability problems of 
computer networks under different performance 
constraints such as delay, throughput, and resource 
allocation. 
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Abstract -- A formal Distributed Systems 
Design Language (DSDL) is proposed. The language 
design places a very strong emphasis on the 
systematic application of the principle of 
separation of concerns. In DSDL, systems are 
described as nets of communicating processes. The 
language allows designers to define arbitrary 
communication protocols and provides a means for 
protocol encapsulation. DSDL is illustrated by 
means of a highly simplified annotated example 
representative of the nature of the language. 
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Introduction 


This paper reports on one effort to develop a 
formal Distributed Systems Design Language (DSDL). 
In DSDL, systems are described as nets of 
communicating processes. Each process in the net 
has its own local data over which it has sole 
control and procedures that specify primitive and 
indivisible operations over the data. The process 
also has the ability to exchange messages with 
other processes in the net. The behavior of the 
process specifies the order in which its 
procedures are invoked. Sequences of procedures 
are allowed to execute concurrently within the 
process. 


Several considerations have influenced 
heavily the nature of the DSDL: an emphasis on 
formality, a desire to promote the principle of 
separation of concerns, a need to support 
hierarchical specifications, and an aim toward 
generality. Formality is achieved through the use 
of set theoretical models for data representation 
and by employing predicate calculus in defining 
the procedures (using input/output assertions). 
The principle of separation of concerns is 
reflected by the manner in which the definitions 
of the net and of the process are structured; 
they are meant to enhance the designer's ability 
to describe the system in terms of clean 
abstractions. Hierarchical descriptions of the 
system are enabled by the fact that processes may 
be refined into nets. Finally, the generality of 
the language is enhanced by its capacity to 
describe a variety of communication structures and 
protocols. 


DSDL is described below by presenting the 


formal model of distributed systems on which it is 
based and a small example of its use. 
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Language Definition 


Process Definition 


The process is the basic functional unit of 
DSDL, and is similar to the guardian [1] and the 
monitor [2] (except that internal parallelism is 
allowed). It serves to encapsulate data as in the 
abstract data type [3,4], and has sole access to 
its own data. In addition, a process is able to 
receive and send information via messages as in 
[5]. A set of procedures are defined for the 
process which perform indivisible operations on 
data or communicate with the environment. The 
behavior of a process is then defined as the set 
of allowable sequences of procedure invocations 
within the process. 


A process p is defined as a five-tuple 
p = (Dp, Tp, Rp, Sp, Bp) 
where 
Dp = (Qp, Hp, Ip) 
Dp - process data definition 
Qp - set of data entities 
Hp - data invariant assertion 
Ip - initial value assertion 
Tp = {z | z = (Ain(Dp), Aout(Dp, Dp'))} 


Tp - transformational procedures 
Ain - input assertion 
Aout -— output assertion 

Rp = {z | z= ('true', Aout(Dp))} 


t 
Rp - message receiving procedures 
Sp = {z {| z= (Ain(Dp), ‘'true')} 
Sp - message sending procedures 
Bp SUBSET.OF (Tp U Rp U Sp)* 
Bp - process behavior, given as a set 
of procedure invocation sequences 


Data. The first element in the 5-tuple 
representing the process p is the data Dp which 
belongs to that process. This in turn is defined 
as a triple (Qp, Hp, Ip). Qp is a set of data 
entities controlled by p and whose elements may be 
accessed only by procedures within p. Hp is the 
data invariant, which is a predicate describing 
the properties which must be possessed by the 
elements of Qp both before and after all data 
transformations The last element, Ip, is a 
predicate defining the initial values for each 
data entity in Qp. 


The data controlled by a process appears in 
the "DATA" section of the process definition. 
Both variables and constants may be declared using 
statements whose syntax resembles Pascal. The 
variable declarations are placed side by side with 
predicates that are taken to be parts of the 
invariant Hp (e.g., VAR n: INTEGER; n>0;). Both 
variables and constants are either sets or 


elements of sets. Some sets are assumed to be 
built-in (e.g., INTEGER) while others are 
constructed by enumeration (e.g., S={1,2,3}), by 
providing an intensional definition (e.g., 

S={z | O<z<4 AND INTEGER(z)}), or by means of 
standard set operations (e.g., union, 
intersection, etc.). In addition to sets, a 
notation for functions and relations is also 
evailable. 


Procedures. Process activities, data 
transformations, and message exchanges are defined 
by the procedures it controls. Transformational 
procedures, given in terms of input and output 
assertions, describe the state changes which occur 
during process execution. Message exchanges are 
carried out by two built-in procedures (SEND and 
GET) whose semantics are stated in the 
communication section of the net definition. 


The use of nonprocedural specifications in 
defining the meaning of the transformational 
procedures enhances the understandability of the 
process specification. Furthermore, by treating 
procedure invocations as primitive operations over 
the data, the need for synchronization within a 
process is avoided in the same way as it is done 
in the monitor concept of concurrent Pascal [6] 
(but without prohibiting concurrency from 
occurring in the process). 


Syntactically, the definition of 
transformational procedures is straightforward. 
Pairs of input ("IN:") and output ("OUT:") 
assertions are used to cover distinct cases. When 
an input assertion is followed by several output 
assertions a conjunction between them is implied. 
An exception ("EXPT:") assertion may be provided 
to indicate the action to be taken in case all 
input assertions fail (e.g., NIL, RESTART, ABORT, 
etce.). Standard predicate calculus is used in 
constructing the assertions (AND, OR, XOR, NOT, 
IF-THEN-ELSE, i.e., implication), and a name 
followed by a single quote within an assertion 
denotes the value of a data item after the 
completion of the procedure. 


Behavior. The behavior of a process is 
defined as the set of all allowable sequences of 
events within a process, where an event is defined 
as an invocation of a procedure. Constructs 
available for the behavior specification include 
event sequence (BEGIN - END), concurrent event 
sequences (PARBEGIN - PAREND), conditional (IF - 
THEN - ELSE), nondeterministic selection (CASE), 
and repetition (WHILE, DOUNTIL, LOOP). 


Net Definition 


In order to specify a distributed system, the 
concept of a net is included in DSDL. A net is 
defined as a set of independent, concurrent 
processes which communicate among themselves by 
means of messages [2,7,8]. Messages are sent over 
abstract communications paths called links. The 
behavior of these links, along with the behavior 
of each process, determines the behavior of the 
net as a whole. 
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A net n is defined as a four-tuple 


mee CPgo PY ose. C) 

where 

P = {pi pis a process } 

P' SUBSET.OF P 
P' contains those processes considered 
to be part of the environment. 

G--& {1 | 1 SUBSET.OF Pj 


where a link 1 is defined by the set of 
processes that may use it. 
C : L ==> POWERSET.OF ( (UNION OVER ALL p OF 

(Sp UNION Rp))*® ) 

with 

C(1) SUBSET.OF (UNION OVER ALL p 
MEMBER.OF 1 OF (Sp UNION Rp) )* 

i.e., C(1) establishes the link behavior 

which determines the set of allowable 

sequences of send and receive events over 

the link. 


Processes. The first two elements in the 
4-tuple describing the net are the set of 
processes P and a set P! such that P' SUBSET.OF P. 
P' contains those processes in P which are 
considered to be part of the environment. 
DSDL, these processes are declared with an 
"EXTERNAL" attribute. Whenever the environment 
plays only a marginal role in the specification of 
the system, however, the external processes and 
the links that connect them to the system may be 
declared to be undefined. 


In 


Processes at one level of the specification 
may represent abstractions of entire nets to be 
identified later. DSDL allows one to state this 
fact through the use of the attribute 
"REFINEMENT. OF" as in the example below: 

NET netname; REFINEMENT.OF pname. 


END-NET netname. 
where the net "netname" is identified to be a 
refinement of the process "pname". Furthermore, 
any entity in the net may be declared to be a 
refinement of some entity in the process as long 
as consistency is preserved. 


Links. The last two elements in the net 
definition are the set of links L and the 
communications protocol C. The links represent 
the available paths of communication within the 
net. For each link 1 a set C(1) of allowable 
sequences of send and receive type events in the 
processes that it connects is defined. This set 
describes the communications protocol on the link, 
and is referred to as the link behavior. The link 
behavior is specified in the same way as the 
process behavior. 


In DSDL, each link is defined by a unique 
name, the processes connected by it, and a 
description of its behavior given in the section 
on communication. More than two processes may 
have access to the same link and the same two 
processes may have more than one link in common. 
The motivation for this approach is to be found in 
the desire to enable the description of arbitrary 
interconnection structures. Furthermore, a link 


may be later refined as a net that implements the 
behavior of the link. A packet switching net, for 
instance, may be described first as a link between 
all the nodes it services and may be subsequently 
refined to include the switching nodes and their 
protocols. 


For every link in the net, a behavior 
description has to be included in the 
communication definition section. The link 
behavior defines the communication protocol 
associated with the link, i.e., the semantics of 
the GET and SEND commands. In defining the link 
behavior, the designer may use the same means of 
Specification as in the description of a process 
behavior except that the set of events that may be 
involved is restricted to the invocation of 
receiving procedures, the invocation of sending 
procedures, and the resumption of processing after 
the invocation of sending or receiving procedures 
in processes aSsociated with the link. The 
resumption of an event sequence in a process is 
specified by the event "pname:GO" in the link 
behavior. 


Conclusions 


DSDL has been exercised on several small 
problems. These exercises were useful in 
demonstrating the language's power of expression. 
Nevertheless, there are many unresolved issues. 
First of all, a meaningful evaluation of the 
language demands its use on a real-life project of 
adequate complexity. Second, future advances in 
the study of the formal aspect of the language 
must be carried out in preparation for potential 
incorporation of DSDL in a computer-aided design 
system. As of now, a definition for consistency 
between levels has been proposed but proof 
Strategies need to be developed. Finally, a 
complete system specification language ought to 
include the capability to define processors and 
their characteristics, the rules for allocating 
processes among processors, and performance 
specifications. Research in these areas is 
currently under way and the results will be 
reported elsewhere. 
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Example 


NET consumer _producer. 


PROCESS producer. 


DATA. 
VAR count INTEGER; count > 0; 
INITIALIZATION. 
ecount..<= 1% 
PROCEDURE w := prepare. 
IN: count = n AND n > O;3 
OUT: w' = (n,t) AND CHARSTRING(t) AND 
count! = n+ 13 
EXPT: NIL; 
BEHAVIOR. 
LOOP {w := prepare; 


SEND w TO consumer ON channel; } 
END-PROCESS producer. 


PROCESS consumer. 
VAR count INTEGER; count > 03 
INITIALIZATION. 
UNDEF INED. 
PROCEDURE use(w);3 
IN: w= (n, t) AND n MEMBER.OF INTEGER 
AND n > O AND CHARSTRING(t); 
OUT: count’ = n; 
EXPT: NIL; 
BEHAVIOR. 
LOOP {GET(w) FROM producer ON channel; 
use(w)3} 
END-—PROCESS consumer. 


LINKS. 
channel: (producer, consumer); 


COMMUNICATION, 
channel : 
LOOP { 
producer :SEND(z) TO consumer ; 
consumer :GET(z) FROM producer; 
{producer:GO; // consumer:GO;}  } 


END-NET producer consumer. 
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Abstract 


of a VLSI compatible, 
partitionable multicomputer array 
is Peceen ves in this paper. The resources of the 
system can be divided into arbitrary size 
rectangular partitions, and various computational 
Structures can be configured on each partition by 
distributed configuration process. 


The design 
reconfigurable, 


each partition with a 
rivate bus is introduced. This bus is essential 
or instruction broadcast and also augments’ the 

communication capability of each partition. 


A scheme to provide 


Design of a special purpose hardware unit 
responsible for the rocess of par tition 
allocation is described. is unit assures that 
the allocation process will not degrade the entire 


system performance. 


I. Introduction 
Due to the recent advances in VLSI 
technology, parallel processing systems which 
consist of thousands of processors have become 


feasible [1]. To reduce design turn around time 
for such enormous systems and to achieve high 
densities in VLSI, simple and regular 
interconnection schemes are highly desirable [2]. 
Qut of the many multicomputer interconnection 
Structures that have been proposed, the mesh type 
networks such as those employed in ILLIAC IV{3] 
and systolic arrays[2] are especially appealing 
because they possess the property of simplicity, 
regularity, modularity, linear cost growth, and 
ease of routing. All these merits make the mesh 
type networks suitable for VLSI implementation of 


large scale computer systems. 


In this paper, the system design of the Mesh 
Organized Partitionable Array Computer (MOPAC), a 
partitionable SIMD/MIMD (PSM)[4] multicomputer 
system with an architecture based on the two 


dimensional mesh type interconnection network, is 
described. Some of the features of MOPAC are: 
. The processors in the mesh network can _ be 
partitioned into rectangular submeshes 
(partitions). Each user's job can _ be 
executed on one or more of these partitions. 


Each partition can work in either the SIMD or 
the MIMD mode. 

. A single user's partitions can communicate 
with one another and different  users' 
partitions are isolated from one another. 


This research was supported in part by the 
National Science Foundation Grant MCS-8116099. 
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. The size of a rectangular partition can be 
arbitrary. There is no limitation on the 
Size of the smallest partition that needs’ to 
be allocated, therefore high resource 
utilization is possible. 

. The processors in each partition can be 
con fig ur ed into var ious computational 
Structures. Di str ibuted algorithms are 
designed carry out the configuration 
process, 


to 


Since each partition may have to operate in 
SIMD mode, it must be provided with a 
communication medium for the broadcasting and 
receiving of common instructions’ and operands. 
Moreover, these media among different partitions 
Should not interfere with one another. In MOPAC, 
a scheme called the Partitionable Bussing System 
has been developed to equip each partition with a 
private bus. In addition to being used for 
broadcasting, this bus also allows nonadjacent 
processors in a partition to exchange information 
directly, therefore enhances the general 
communication and synchronization capability of 
the partition. 


the 


One of the most time consuming operations for 
the control unit of MOPAC is the allocation of 
partitions from the mesh network according to each 
job's demand. The Speed of this process 
profoundly affects the entire system performance. 
Due to the lowering hardware costs and speed 
improvements, it has become natural for system 
designers to incorporate many of the repetitious 
operating system functions into hardware modules. 
In MOPAC, a special purpose hardware mechanism 
called the Partition Allocation Unit is designed 
to administer the allocation process. This unit 
reduces the overhead incurred by allocation to the 
minimun. 

In Section II the 
operation of MOPAC 


overall organization and 
is described. Section III 


presents the Partitionable Bussing System, and in 
Sec tion IV the operation of the Partition 
Allocation Unit is discussed. The conclusion 
follows in Section V. 

If. System OQ ganization and Operation 

Each processing element (PE) in MOPAC 
consists of three components: an application 


processor (AP), a memory unit, and a communication 
processor( CP). The application processor is 
responsible for the execution of users' programs, 


Which are stored in the memory unit along with 
data. The communication processor is responsible 
for communicating with other PE's and the Host. 


All communication processors in the system are 


organized as an nxn _ two dimensional mesh, with 
corresponding pairs of edge processors connected 
to form the topology equivalent to a torus 
(Figs) 

Each communication processor is also 


connected to segments of the Partitionable Bussing 
System (PBS). When a particular structure on a 


partition is formed, the PBS bus segments of the 
individual PE's in this structure will be tied 
together as a Single bus. 

The Host is the unit which coordinates’ the 
activities at the system level. Its 


responsibilities include user program development, 
input and output handling, determining the proper 
Size of the partition(s) required for a_ user's 


job, allocation of the partitions from the mesh 
network, and generation of the initial structure 
configuration messages. 

All CP's and the Host are connected to the 


Systen Bus (SB). It is through this bus that the 
Host can communicate with each PE when required. 
The principal uses of the Systen Bus are: 


-After a partition is formed, the Host will 
use SB to transfer initial configuration 
messages to a proper cell (called the Initial 
Cell) in the partition to start the structure 
configuration process. 

»- Wnen a job in a partition finishes 
execution, the Initial Cell will report this 
Status to the Host through SB. 

If a user's job has two or more partitions 

executed in parallel, these partitions can 
exchange information through SB. 
. For fault diagnosis and fault tolerance 
purposes, x copies of a single job may run on 
X partitions simultaneously. The partitions 
may use the SB to communicate with each other 
and verify the results. 


is as 
For a large class of numerical problems, 


The partitioning philosophy of MOPAC 
follows. 


image processing problems[5], and systolic 
algorithms[2], the two dimensional mesh is one of 
the most appropriate interconnection structure. 


It is also observed [6][7] that many other SIMD or 
MIMD type of computation structures, such as 
linear array, pipeline, binary tree, and ring, can 
be embedded in rectangular mesh. In addition, to 


partition a mesh network, the simplest and most 
natural way is to partition it into Sub-meshes. 
Considering the above, MOPAC will allocate 


rectangular partitions of the mesh for each user's 
job. 


The allocation process of MOPAC is outlined 
below. After the Host has compiled a user's 
program and determined the dimensions of the 


rectangular partition(s) needed[{7], the Host will 
have to allocate the partitions from the available 


PE's on the mesh, Since there might have other 
user's jobs (i.e. other partitions) which occupy 
part or all of the PE's in the mesh network 


the Host 
a partition of the required size. 


(Fig.2), it becomes quite difficult for 
to search for 


~ the 
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This searching process is aided by the Partition 
Allocation Unit (PAU). If PAU cannot find the 
desired partition, the job will have to wait in a 
queue, until some of the jobs on the mesh finish 
processing and release their partitions. If the 
partition is found, the configuration process for 
the particular job structure may start. 


of the 
distr ibuted 


communication 
configur ation 
given in [6]. 
defined in the 


The organi zation 
processor and the 
process are Similar to those as 
There are five registers 
communication processor : 

ID Register: Contains the address (row and column 
indices) of the cell. 
Mode Register(MR): Indicates the mode of 
operation of each cell. Each cell can be either a 
processing cell (PC) par tici pating in the 
processing of a given structure or a connecting 
cell (CC) whose role is to transfer data in 
different directions without performing any kind 
of processing. 

Configuration Register (CR): 


Contains information 


about the present structure, and the logic address 


of the cell within the structure. 

Predecessor Register: A four bit register which 
stores the directions of the predecessor cells in 
each structure. Each bit corresponds’ to an 


element of {W,S,N,E}, the four directions on a 
compass. 

Successor Register: Similar to the Predecessor 
Register except it stores the directions of the 


successor cells in each Structure. 


The configuration process starts when the 


Host sends a configuration message to the Initial 
Cell in the partition. The general form of the 
configuration message is: 

M( structure type, size of the structure, size of 


the partition p and q, level within a structure, 


direction) 


The structure type is the code name for. the 
current structure that is being configured, such 
as LA stands for linear” array. Size of the 
structure indicates the size of the current 
structure. For example if the configuration is a 
linear array with k elements, size of the 
structure will be k. Size of the partition p and 
q indicates the row and column widtn of the 
current par tition that the job got allocated. 
This part of the message is vital because by 
knowing this information, the CP's will not try to 
construct a structure outside its ow partition, 
therefore preserve independence among different 
partitions. Level within a structure indicates 
level within a structure that the next 
processing cell should assume. For example the 
first element in a linear array structure usually 
assumes level 0, the second element level 1, etc. 
This level number may also be viewed as_ the 
logical address of the cell in a_ structure. 
Direction d is a parameter which directs the 
transmission direction of future configuration 
messages. One of the functions defined on d is 
op(d), which is the 180-degree opposite direction 
to d. For instance op(S)=N, op(E)=W, etc. 


After the Initial Cell receives the initial 
configuration message, zt will examine the 
contents of the message, modify the values of some 
of the parameters, and pass the message to proper 
neighboring cells in a similar format. The 
neighbors will repeat the sane procedure, until 
the whole structure is configured on the 
partition. Since the configuration is done in 
such a distributed manner, the burden that would 
have been put on the Host and System Bus if the 


configuration is done 
eliminated. 


in a centralized way is 


For illustration, the distributed algorithm 
to configure a linear array of size k on a pxq 
(pq>k) partition is shown below. The position of 
the Initial Cell is at the northwestern corner of 
the partition, and the initial configuration 


message generated by the Host is M(LA,k,q,0,q,E). 


For each cell: 
Receive M(LA,k,q,l,c,d) from predecessor ; 
Beg in 
CR: =LA,1; (*configure to the lth element 
of the linear array*) 
if 10k-1 
then (*configuration process not 
complete*) 
if cal 
then(*come to the edge of 
~ partition*) 
transmit M(LA,k,q,l+1,q,op(d)) to 
S=-neighbor 
else 
transmit M(LA,k,q,l+1,c-1,d) to 
d-neighbor 
else (*]=k-1*) 
configuration process complete 


End . 

The message transmission pattern is show in 
Bigs 3. 

The c parameter in the message is used aS a 


counter to detect if the message has reached a CP 
on the edge of the partiton. At the beginning of 
each row in the transmission path, c is set toq, 
the column width of the partition. As the message 
is passing from CP to CP, c is decremented by 1 
each time, until when the processor at the edge of 
the partition sees c=1. At this point, the edge 
processor will reset the value of c to q, reverse 
the direction parameter, and pass the message to 
the S-neighbor. 


Configuration algorithms for structures such 
as Square array, ring, and tree can be found in 


C6](7]. 


The private bus for the partition is also 
formed during the configuration process (Section 
III). When the process is finished, the last CP 
will broadcast "configuration complete" signal to 
other CP's in the partition through the private 
bus, and the execution of the job can begin. 


III. The Partitionable Bussing System 


As shown in Fig.4, Each CP has four PBS bus 


segments. For CP(i,j), the E-segment is connected 
to the W-segment of CP(i,j+1), the N-segment is 
connected to the S-segment of CP(i-1,j), etc. 
Intervening between a pair of segments is a 
Switch, whose state will determine if the two 
segments are connected. 


The state of the switch is controlled by the 


setting of the bits in the Successor Register. 
When the configuration message for a particular 
structure is transferred from a cell to its 
d-neighbor (dé{W,S,N,E}), the d-bit in the 
Successor Register of this cell is set. Since 
‘this bit also controls the State of the 
corresponding switch, the bus’ segments between 
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this cell and the d=-neighbor can now communicate 


via the turned on switch. 


As the configuration messages are being 
transferred from CP to CP, the switches between 
predecessor cells and successor cells are _ turned 
on accordingly. After the configuration process 
is completed, the bus segments belonging to the 
cells of the same structure are all connected, and 
they together serve as one single bus for’ the 
par tition. Since the configuration message for a 
job is designed to run in its ow partition only, 


Switches among CP's in different partitions will 
not be activated, so the bussing system among 
different partitions will be isolated from one 
another. 


After a job finishes execution, each CP in 
the partition will clear its individual Successor 
Register, and the PBS segments are disconnected 
from one another. The released resources in this 
partition are now ready to participate in new 
partitions. 


an additional bus for the 
computational structure configured on a partition 
augments the communication capability of the 
Structure. It was shown in [8] how the complexity 
of finding the maximum element on a square array 
Structure 1s reduced by using the global bus. 
Another example is the binary tree structure. 
Though it is expected that most communications 
would occur locally between parent and child 
nodes, for the occasional situation that remote 
leaf nodes need to communicate, the global bus can 
be used without having to relay the messages up 
and down the entire tree. 


Providing 


IV. The Partition Allocation Unit 
The core of PAU (Fig.5) is an associative 
memory, where searching for a particular bit 


pattern can be conducted by each word in parallel. 
The bit pattern is stored in the (Comparison 
Register, and for those bits that do not 
participate in a certain search, they can be 
masked out by setting corresponding bits in the 
Mask Register to logic 0O. If the search is 
successful in a memory word, the corresponding bit 
in the Match Register will be set to 1, otherwise 
it will be set to 0. 


is 
The 


The size of the associative memory (AM) 


nxn, the same as that of the mesh network. 


contents of each cell in AMis set to 1 if the 
corresponding processor in the mesh is free, or to 
O if the processor is busy. After each allocation 
for a new partition, and after the completion of 
the job on an existing partition, the state of the 


correspond ing eellis in AM will be updated 
accordingly. All faulty processors will also be 
marked as busy. In other words, the AM serves as 
a hardwired bit map which keeps the status of all 
processors in the mesh network. 

For the purpose of PAU operation, all _ the 


bits in the Comparison Register are set to 1, and 
use only the Mask Register (which is connected as 
a circular shift register) to control the 
searching process. When a job demands a 
rectangular partition of size pxq, the Host will 
issue a search command to the control unit of PAU. 
The search process starts by setting the first q 
bits in the Mask Register to 1 and the initiation 
of the AM operation. All words in AM will examine 
in parallel whether their first q bits are 1, and 
if so, the corresponding flags in the Match 
Register are set. 


The result of the 
tells the availability 
column width of qin the 
mesh. The remaining 


above process) actually 
of any partition with a 
first q columns of the 
task of selecting the 
partition with row width p lies in examining the 
contents of the Match Register. If there are p 
consecutive bits in the Match Register that are 
set, a partition of the desired size is found. 
The AND Network following the Match Register is 
designed to provide this function. 


There are n stages in the AND Network, and 
each stage has n outputs. Let the kth bit of the 
Match Register be denoted by MR(k), and the output 


of the jth AND gate in stage i (ié[1,n], jé[0,n-1]) 
be denoted by OUT(i,j), then 
j+i-1(mod n) 

m7 MR (k) 


OUTCI4i= =] 
where N denotes the AND operation 


In words, the state of OUT(i,j) tells whether 
there are i consecutive 1's starting with the jth 
bit in the Match Register. If the desired 
dimension of partition is pxq, then the function 

Loy | 

FOUND= 3=0 OUT (ps3. 
where U denotes the OR operation, would reveal if 
at least one such partition exists. 


To know if a 
If it does, the 


partition exists is not enough. 
Host must also know the location 
of the partition. This is the function of the 
priority encoder shown below the AND Network in 
Fig .5. The connection between the outputs of each 
Stage of the AND Network and the inputs to the 
priority encoder is controlled by a set of n 
OUT-CONTROL (OC) signals. When OC(p) is set, only 
the outputs from stage p, i.e. OUT(p,j), O<j<n-1, 
will be fed into the priority encoder. If there 
is only one OUT(p,j) true, j will appear at the 
output of the priority encoder, which is marked as 
ROW-ADDRESS in Fig.5. If there are more than one 
j's such that OUT(p,j) is true, then only the 
smallest of such j will be selected as ROW-ADDRESS 
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by the priority encoder. 


The above procedure constitutes one cycle of 
Operation of PAU. At the end of the first cycle, 
if control unit sees that FOUND is false, this 
indicates that there is no partition of size pxq 
exists in the first q columns of the mesh. To 
determine if such partition exists in the second gq 
columns, the control unit will shift the Mask 
Register right by 1 bit, and a second search cycle 
begins. This process continues until at a certain 
cycle when FOUND turns out to be true, or until 
after n cycles, when FOUND is false in every 
cycle, then the control unit can tell that there 
is no partition of size pxq exists. If FOUND is 
true in cycle ec (cé[0,n-1]), the row address of 
the cell in the northwestern corner of the 
partition 1s _ shown at the output of the priority 
encoder, ROW-ADDRESS. The column address isc. 


The PAU is amenable to be 
special purpose VLSI chip. The structure of the 
associative memory and the AND net wor k are 
regular, therefore the design of elementary cells 
and the layout of them are straightforward. In 
addition the eireuLt complexity of PAU is 
proportional to N, the number of PE's in the mesh, 


implemented on a 


while the number of pins required iS proportional 
to log WN. This high circuit complexity to pin 
ratio makes PAU ideal for Single chip 


implementations. 
V. Conclusion 


As the 


design cost and design op hea time seem 
to be two e 


major obstacles in th evelopment of 
large scale VLSI systems, a highly Pee 
reconfigurable architecture such as the MOPAC 
system becomes especially valuable. Since MOPAC 
is regutar, the initial design man-year effort is 
Significantl reduced. Since MO PAC is 
reconfigurable, it can provide a wide range of 
capabilities and the custom design cost for 
individual applications is eliminated. 
addition, the inde pendent bus within 
par tition combined with direct interprocessor 
connections provide an excellent environment for 
jobs exhibiting str ong locality of shared 
resources. Thus it iS concluded that MOPAC can be 
truly considered as one of the viable VLSI 
architectures. 


ular and 
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The Multiprocessor EMPRESS 
A Useful Tool for Studying Parallelization Concepts 


Hans-Joerg Brundiers, 


Richard E. 
Hansmartin Friess and Milan Tadian, 
Swiss Federal Institute of Technology, 


CH-8092 Zurich, 


Abstract. This study describes preliminary 
results of general interest based on first 


utilization of the multiprocessor EMPRESS of 
the ETH Zurich. To carry out the study, a 
electrical-engineering reference problem was 
solved in the parallel mode using the 
parallelized version of the program PSCSP 
(which numerically simulates continuous 


Systems using the integration method of Taylor 
series). 


Results are presented for the following 
asynchronous parallelization concepts: 

- single-stage parallelization 

double-stage parallelization 

improved single-stage parallelization using 
a new parallel algorithm for Taylor series 
and a partially distributed control of the 
parallel processes, 
double-stage parallelization, 


simulation of 


optimal hardware by scaling the instruction 
timing. 

Two further parallelization methods are 

proposed to be tested on EMPRESS: 

- nearly completely distributed control of 
parallel processes and 

- complete prescheduling and synchronous 


operation. 


Introduction 
Multiprocessor Hardware 


The multiprocessor EMPRESS is a model computer of 
the MIMD type assembled at the ETH Zurich. The 
system comprises 16 LSI-11 processors (called 
execute processors=EP) and one PDP-11/34 (called 
supervisor). The newly developed communication 
hardware consist of two networks (for a detailed 
description see [1]): 


- the intercommunication memory ‘intercom’ is 
organized as a quadratic matrix of 17x17 RAM 
segments. It allows simultaneous and delayless 


exchange of data. 

in the process distribution system, a hard wired 
’job controller’ monitors the 16 EP’s via a fast 
parallel bus. 


The hardware 
parallelization. 


supports two stages of 
In the first stage the supervisor 
distributes processes among the _ EP's. In the 
second stage a ‘'master’ EP can form a_ logically 
contiguous group together with other 'slave’ EP’s 
and distribute sub-processes among the latter. 


Application Software 


As first application, a parallel 
Taylor series for the numerical 
differential equations [2] was tested. 
concept of this method is the 
approximation of the solutions by expansion into 
Taylor series. Higher terms of the series are 
calculated recursively from previously calculated 
terms. The algorithm provides a decomposition of 
arithmetic expressions within the differential 
equations into single preprogrammed recursion 
formulae. 


algorithm using 
integration of 
The basic 
piecewise 
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is exploited on the level of 
expressions as well as on the level of 
For nonlinear recursions the 
formulae contain typically summations of the 
following form (where the inherent parallelism 
depends on the order n): 


Parallelism 
arithmetic 
recursion formulae. 


m(n) 


Ps 
M=k 


at 


The algorithm is realized (in sequential form) in 
the program PSCSP [3] for the numerical simulation 
of continuous systems developed at the ETH. A 
restricted version of this package has _ been 
adapted to the multiprocessor. Its main parts are 
mentioned briefly (see also [1]): 


- The separate precompiler accepts the user- 
provided definitions of the differential 
equations and produces a Fortran program 


containing calls to recursion formula. 

The code generator does a symbolic execution of 
this program accounting for run time values of 
parameters. Thus it generates an optimized 
branchless code sequence for the calls. 

The scheduling routine establishes a tabular 
process dependency graph corresponding to the 
code sequence. 

The process controller guides the asynchronous 
parallel execution of the processes by 
interpreting the dependency graph. (These parts 
run in the supervisor) 

The library of recursion formulae 
to run in the EP’s in 2 versions: 
- evaluation of a formula is One process in one 
EP 

formulae are sub-parallelized; sub-processes 
run simultaneously on various processors. 


preprogrammed 


This hardware and software system offers several 
possibilities for performance measurements and for 
testing other parallelization methods. 


introduces the refence problem 

tests followed by the results 
obtained using the basic system. Section IV 
describes results obtained using a improved 
software. In section V we _ present results 
obtained by simulating optimal hardware. In the 
final section we mention two other parallelization 
concepts to be tested on EMPRESS. 


The next paragraph 
used for perfomance 


II Reference Problem 


As 
system of 
dynamics of 


reference problem, we chose the following 
differential equations describing the 
an electrical synchronous machine [4]. 


$1 =L" * V(I,8,s), 

) = 0 ae 
I designates a vector of 5 currents, La 5x5 
coefficient matrix, V a vector of 5 voltage 


functions, s the load angle and 6 the slip. 


This problem has an adequate degree of parallelism 
(seven integrators), nonlinearity (which means 
subparallelism in recursion formulae) and coupling 
(which means complexity in the dependency graph). 
Fig. 1 shows the dependency graph pertaining to 
this problem and a certain set of parameters. The 
numbers correspond to processes (=evaluation of 
recursion formulae) to be executed in the EP's. 
The bold typed numbers indicate recursive processes 


whose internal parallelism and time behaviour 
depend on n. To get the n’th term in the series 
axpansion, this graph has to be processed for the 


order n, where n goes from 0 to 15. 


19 —~ 34—35—37—»38 
ane \,_ 73379367 
: i 


Fig. 1 
Dependency 
Graph 1113-14 1617-418 29—+31-+32 
12-+51.. 46. 
52————+ 53 -+ 5 4-+ 55 +56 45-—+47-+ 48 
49-—+50 


III Basic System 


The operation of the multiprocessor in the single- 
stage mode is illustrated inthe process flow 
diagram in fig.2a, for a pass through the graph 
with n=9. At the beginning, the supervisor signals 
to the job controller that the ‘root processes’, 
like 19, 7, ..- are ready to be started. After 
finishing a process, the EP sends a _ process 
identification to the supervisor, which in turn 
transfers the results to the common data region of 


the intercom and checks whether any follower 
process is ready to be started, until the last 
‘leaf process’ terminates. 

In this mode the recursive processes lead to a 
delay, as expected. Further delays (e.g. 4 which 
follows 3) are due to overloads in the supervisor 


process controller. The mean speed-up is 3.4. The 
graph shows a mean parallelism of 5.6, defined as 
(total number of processes)/(number of processes on 
critical path). 


Fig.2c shows how the system works in the double- 
stage mode. The recursive processes are now split 
into a master process and several slave processes. 
There is only a small gain in the mean speed-up, 
due to overhead in the administration of sub- 
processes and non optimal choice of the number of 
sub-processes. 


IV Improved Software 


In ref.{5] another parallel algorithm for the 
recursive calculation of Taylor series is proposed. 


This algorithm has been installed in the basic 
system: Let j, be a recursive process in the graph, 
like 14, used to calculate the following sum: 


n 
R, = p Paok - Q, 

=O 
This sum is no longer split into 
the time j, is 
jn., and j,., are 
sums depending 


sub-processes at 
processed, but two new processes 
prescheduled to do the partial 
on terms of the order n-2 and n-1l 
respectively. When processing the graph for n-1, 
j” is started as a root process, and j’ as follower 
of j. j’’ and j' are no longer on the critical 
path, as shown in fig.2b (for j=52 j'' and j' could 
be taken together because j finished before j''). 


A process control routine was installed in the 


EP’s in order to distribute partially the process 
control load. This routine has direct read-only 


access to the dependency table stored in the 
common data region. Having completed a process, 
the EP undertakes the next awaiting follower 


process, if any. The supervisor remains engaged 
in communicating data to other followers and 
starting them. Fig.2b demonstrates the effect of 
both improvements. 


V Simulation of Optimal Hardware 


In ref.[6] hardware improvements for’ the 

multiprocessor are proposed including: 

- EP’s with registers and instructions 
to the process distribution hardware 

- intercom with access time same as for local 
memory and automatic wait before reading results 
from other processes 

- faster process distribution bus 

- dedicated process distribution hardware in the 
supervisor. 


dedicated 


Such hardware were simulated in the basic system 
by delaying instructions for process control and 
execution in the supervisor and EP's in such a way 
that non optimizable instructions are delayed by a 
factor of 10 and others by an estimated value <10. 
Fig.2d indicates a significantly higher speed-up. 


VI Conclusions 


The speed-up can be summarized as follows: 


1-stage parallelization, basic 2.9 3.9 3.4 
1-stage parallelization, improved 3.7 Ye 5.4 
2-stage parallelization, basic 254 4.8 325 
2-stage optimal hardware (x10) 5.4 8.6 7.1 


Encouraged by the improvements obtained so far, we 

plan to test two more parallelization methods: 

- Using the 2nd stage, hardware process’ control 
can be distributed completely: if an EP finishes 
a process with more than one follower, the EP 
can assign remaining followers to slave EP's 
(except in deadlock situations). 

- It is possible to resolve a sequence of 
recursion formulae (as well as other 
expressions) into operations of similar length, 
which can be prescheduled by assigning a logical 
processor number and time slot to each 
operation [5]. Such. a problem can run 


synchronously on EMPRESS: One master EP 
synchronizes the remaining slave EP’s in 
processing the parallel operations. 

A future object in view would be a ‘parallel 


operating system’ for EMPRESS supporting parallel 
processing of more general applications. 
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PERFORMANCE OF A MODULAR INTERACTIVE DATA ANALYSIS SYSTEM (MIDAS)* 


Creve Maples, Daniel Weaver, Douglas Logan, and William Rathbun 
Lawrence Berkeley Laboratory, University of California 
Berkeley, California 94720 


Abstract -- A processor cluster, part of a multi- 
processor system named MIDAS (Modular Interactive 
Data Analysis System), has been constructed and 
tested. The architecture permits considerable 
flexibility in organizing the processing elements for 
different applications. The current tests involved 8 
general CPUs from commercial computers, 2 special 
purpose pipelined processors and a specially designed 
communications system. Results on a= variety of 
programs indicated that the cluster performs from 8 
to 16 times faster than a standard computer with an 
identical CPU. The range represents the effect of 
differing CPU and I/O requirements - ranging from 
CPU intensive to I/O intensive. A benchmark test 
indicated that the cluster performed at approximately 
85% the speed of the CDC 7600. Plans for further 
cluster enhancements and multi-cluster operation are 
discussed. 


Background and Objective 


MIDAS, a Modular Interactive Data Analysis 
System being developed at the University of 
California Lawrence Berkeley Laboratory, is based on 
the concurrent operation of multiple asynchronous 
processors. The architecture is designed to provide a 
highly-interactive, graphics-oriented, multi-user envi- 
ronment that permits programs to dynamically utilize 
multiple processors in a manner appropriate for the 
calculation. The system was= specifically oriented 
towards handling scientific calculations, particularly 
those involving data analysis. This criteria necessi- 
tates that the facility be able to accomodate data 
bases on the order of 200 to 3000 Mbytes per user. 


The basic project objectives were to achieve high 
processing speeds, optimized I/O handling, modular 
hardware and software structure, flexible architec- 
ture, and a fault-tolerant environment. Initially the 
system should be capable of achieving processing 
speeds approximately equivalent to that of a CDC 
7600 per problem. The design is expandable, 
however, and ultimately should be able to provide 
between 10 and 100 times this performance. In order 
to efficiently handle potentially large volumes of 
information, the architecture permits information 
transfers to be optimized to minimize the effects of 
such operations on calculations. A high degree of 
modularity is required to support system expansion, to 
allow future utilization of different CPUs and mass 
storage devices, and to facilitate a fault-tolerant 
structure. 


The effective utilization of multiple processing 
elements in a wide variety of applications requires a 
high degree of architectural flexibility. Programs 
must be permitted to organize and control processors 
and communication in a manner appropriate to the 


*This work supported by the Director, Office of 
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Office of High Energy and Nuclear Physics and 
Nuclear Science of the Basic Energy Science 
Program, U.S. Dept. of Energy, Contract No. 
DE-A03-76SF 0098 
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particular problem. Errar recpvery and human 
engineering are important coneerns in any computer 
system and particularly so in a multiprocessor 
environment. The design criteria requires that the 
system be able to identify and isolate failures at the 
module level and, in most cases, circumvent the 
failure by employing alternate modules (with, how- 
ever, a possible degradation in throughput). Similarly, 
diagnostic tools are required that allow users to 
easily debug code in a real-time environment of 
multiple asynchronous processors. 


Architecture 


Essentially the MIDAS design organizes multiple 
processing elements’ into clusters. Each cluster 
combines a number of central processing units from 
commercial computers with independently developed 
specialized processors and a_ specially designed 
communications system P2336 The basic 
architectural structure is a three-level hierarchy of 
Computer processors, organized in a general tree- 
structure and integrated with independent 
“intelligent” mass storage and interactive systems. 
The three processing levels consist of a Primary 
Computer, Secondary Computers, and Multiple 
Processor Arrays. 


A. Multi-Processor Arrays 


For the purpose of control, MIDAS organizes 
groups of processors into clusters called Multiple 
Processor Arrays. A simplified version of such an 
array, containing ten active processors, is shown in 
Figure 1. Two of these processors, the Input and 
Output Formatter, are specialized pipelined devices 
designed to handle information flow into and out of 
the cluster. They operate independently at a 200 ns 
clock cycle on two separate external 20 Mbytes/sec. 
1/O busses (32-bit data, 8-bit control). These proc- 


essors may, depending on their programming, select 
or reject information (filtering); expand or compress 
data (format); manipulate data (mask, shift, etc.); or 
route specified information as required by other 


processors in the cluster.. 
MEMORY MODULES 


SMH 


cA SWITCH MATRIX A 


INPUT 
PROCESSOR 


OUTPUT 
PROCESSOR 


SHARED MEMORY 


Figure 1. A single Multiple Processor Array 


The eight general purpose CPUs, illustrated in 
Figure l, are referred to as Programmable Arithmetic 
Modules (PAMs). They are standard commercial 
CPUs with dedicated memory, able to handle 
scientific calculations in general, and floating point 
Operations and Fortran codes in particular. For the 
initial development the ModComp 7870 CPUs were 
selected. These processors support 64-bit floating- 
point hardware, pipelined operation, and up to 4 
Mbytes of memory. The CPUs are, for comparison, 
roughly 15% slower than the DEC VAX 11/780. 


B. Communication 


As previously discussed, data flow into and out 
of a cluster is handled by two _ independent, 20 
Mbytes/second busses. Within a cluster, however, 
interprocessor communication may be handled either 
by controlling access to independent memory units or 
via global shared memory. Each of the sixteen 
independent memory modules, illustrated in Figure l, 
has a dedicated memory bus and may contain up to 
256 KB of memory. A 5 x 16 cross bar switch 
allows any memory module to be dynamically 
attached to any of the five processor busses shown. 
Since information transfer between a memory module 
and a CPU (PAM) is considerably faster than the 
cycle time of the CPU, it was possible to time- 
multiplex 8 independent memory-CPU_ connections on 
the same bus with essentially no degradation of 
access time. Jime-multiplexing these connections 
was an implementation, not an architectural, decision. 
Functionally the communication access operates as a 
12 x 16 cross bar switch. 


Any memory module may thus be attached to 
any processor at any time. Switching a memory 
module between available processors requires about 50 
ns. This use of bank-switched memory units’ for 
multiprocessor communication is similar to the Sl 
Multiprocessor architecture under development at 
Lawrence Livermore Laboratory [4]. One major 
difference in implementation, however, is that MIDAS 
currently prohibits a memory module from_ being 
simultaneously accessed by more than one processor. 
When a processor is attached to a module, it has 
exclusive access to that memory until it relinquishes 
it (or until the supervisory CPU forces a relinquish). 
Fach memory module is also equipped with auto- 
zeroing hardware and a current destination directory. 
Thus, when a module is released by a processor the 
directory pointer is incremented and the memory is 
attached to the next class of processor specified. If 
all processors of the specified class are busy, the 
memory will remain unattached until one becomes 


available. Once a- processor-memory connection 
TABLE 1 
DISTRIBUTED SUBSYSTEM 
Functional Memory Classification 
MEMORY 

CLASS TYPE ACCESS AVAILABILITY 
1. Dedicated Program/Data Fixed Always 
2. Switchable Program/Data All Request 
3. Shared Program/Data CPU's Always ~— Queued 
4. Storage Data | All Always — Indirect 


Distributed Subsystem 


MULTI- 
PROCESSOR 


MEMORY 


SECONDARY 
CPU 


TO PRIMARY 
COMPUTER 


Figure 2. A Distributed Subsystem. 


occurs, there is no functional distinction between the 
switched memory and the processor’s local dedicated 
memory. The memory module is accessed’ by 
standard load and store instructions, rather than by 
I/O commands. Thus from a programmer’s point of 
view, the switched memory is simply a_ particular 
common block. 


General communication between processors may 
also occur by means of a global shared-memory unit. 
Access to this memory is given on the basis of a 
demand queue. For store operations, a processor may 
lock out other processors until all memory updates 
are completed. Since frequent accesses of the global 
shared memory could slow the parallel operation of 
the processors, its use should be minimized. Each 
processor may also communicate directly with the 
supervisory computer described in the next section. 


An independent bulk memory unit, with a 32 
Mbyte capacity, is also available for data storage. 
CPU (PAM) access to this unit is indirect in that 
information must be transferred via the switchable 
memory modules. This mode of accessing bulk mem- 
ory is quite efficient with respect to CPU utilization 
since a PAM continues operation immediately after 
releasing a memory module, and does not wait until 
the data transfer to bulk memory is complete. The 
bulk memory has dual ports which can be utilized 
either in a standard DMA transfer or in an address 
incrementing (+1) mode. Table 1 summarizes the 
classes of memory available within cluster, and the 
attributes of each. 


C. Control 


Each multiprocessor cluster is controlled and 
monitored by a standard commercial mini-computer, 
called a Secondary CPU. For simplicity of operation, 
this computer should be compatible with the CPUs 
utilized in the Multiple Processor Array. This two- 
level structure, illustrated in Figure 2, is called a 
Distributed Subsystem, and forms the basic processing 
unit of the MIDAS structure. The function of a 
subsystem is to receive specific problems (from the 
Primary Computer, the highest control level in the 
system) and to break the problem into components 


(e.g., input, output, initialization, computational 
stages, etc.) which can be carried out in parallel. It 
then establishes the corresponding processor sequence 
and communication paths in the Multiple Processor 
Array, and loads the appropriate code into each 
processing element. During execution the Secondary 
monitors and, if necessary, controls the operation of 
the Multiple Processor Array, and maintains contact 
with its supervisor, the Primary Computer. 


In order to perform these operations’ the 
Secondary has its own dedicated disc drive, and runs 
an essentially standard operating system. It therefore 
can, for example, compile, assemble, and link pro- 
grams. The console-control functions of all the 
standard CPUs (in the MPA) are interfaced into the 
secondary, which therefore has complete control and 
monitor capability. It can run, halt, resume, or 
single-step each PAM independently or collectively 
and can monitor or modify selected registers or 
memory locations. The Secondary is, however, too 
slow to directly control the high-speed, asynchronous 
memory switching potentially required in the MPA. 
The actual switching of memory modules and_ inter- 


processor communication is handled by a_ special 
hardware device termed the Conductor. The 
Conductor is functionally controlled by the Secondary 
and can. supply the Secondary with detailed 


information on system activity when requested. 


The distributed subsystem has four communication 
channels to the external environment: the two I/O 
busses associated with the Multiple Processor Array; 
an independent bus attached to the Secondary; and a 
direct communication channel with the Primary Com- 
puter. The first three busses are under the control 
of the Primary and may be switched to whatever 
external devices (or processors) that are appropriate. 


The function of the Primary Computer is to 
handle all user interaction and to allocate and 
manage all the resources of the system. In_ this 


regard it oversees the operation of the multiple sub- 
systems and controls all intersubsystem connections. 
Further, at the request of users, it will supply on a 
real-time basis, status information on interim results 
of executing problems. 


A fundamental objective for MIDAS was the 
support of a highly interactive computing environ- 
ment. The monitoring capabilities of the Secondary 
Computers were explicitly designed to support such 
operation. Within its cluster and without perturbing 
the actual calculations, a Secondary can examine data 
and obtain detailed information (e.g., status, partial 
results, etc.) from the bulk memory unit (via a 
separate memory port) or from any PAM. _ This 
information, when requested, would be transmitted to 
the Primary Computer for presentation to the user. 


Performance 


Development of the MIDAS project was planned 
in three phases: prototype construction of a simplified 
distributed subsystem (to test the basic switching, 
communication, and control concept); development of 
a complete model of a fully operational subsystem 
under control of a Primary (to test performance); and 
then implementation of a full MIDAS architecture 
with multiple subsystems. The project was begun in 
the fall of 1979 and the prototype was operational in 
January 1982. It consisted of 8 memory modules, 
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Table 2 
Results of Benchmark Tests on MIDAS Prototype 
System 
Running Time in Seconds (Ratio) 
1/O. Average CPU 
Limited Mix Limited 
Single 
Computer 272 (6.3) 612 (5.5) 2314 (4.9) (74) 
MIDAS (3)* 49 (1.1) 159 (1.4) 673 (1.4) (1.4) 
MIDAS (4)* <43> (1) 112 (1) 472 (1) (1) 


*Indicates number of parallel CPUs 


eres mee nate ema eee = samen eee eaten ne ea 


input and output processors, a bulk memory unit, 4 
CPUs, and the’ high-speed switching hardware 
(Conductor). 


Based on the results of software simulations, the 
performance of the system was expected to be de- 
pendent on the relative I/O versus CPU requirements 
of each problem. This is to be expected since the 
system was designed to minimize 1/0 impact on 
calculations. Benchmark testing therefore utilized 
various existing programs with different calculational 
conditions. These programs were written in three 
different languages, with Fortran being most common. 
In order to operate in a parallel environment some 
modifications of the original programs were required. 
The required modifications were not major, and 
considerable effort was made to minimize such 
changes in order to permit accurate benchmark 
testing. The program. structures were, therefore, 
deliberately not optimized to take full advantage of 
the architecture. 


Table 2 compares the time required to run four 
different problems on a standard ModComp computer 
and on MIDAS, utilizing 3 and 4 ModComp CPUs, 
respectively. The speed increases observed in the 
CPU-limited problem (specifically a Monte Carlo 
fission simulation program) tracked exactly with the 
number of CPUs used, as was expected. The I/O- 
intensive problem, however, ran over six times faster 
on MIDAS (with 4 CPUs) than on the standard 
computer. This relative performance increase is 
attributable both to the existence of independent I/O 
processors within MIDAS and the more efficient 
handling of information transfer (the problem required 
the examination and analysis of approximately 10 
Mbytes of information). The value of 43 seconds, 
obtained with the 4-CPU MIDAS prototype, was sub- 
sequently determined to be an artificial limitation 
imposed by a problem in an external commercial disc 


controller. It was unable to supply the continuous 
high-speed data rate MIDAS was’ capable of 
accepting. The accuracy of calculational results were 


independently verified in all cases. 


The second phase of development, _the 
construction of a complete subsystem, was completed 
in February 1983. The layout of this model is shown 
in Figure 3. Performance of the system was again 
tested with a similar mix of programs and the results 
of these tests are shown in Figure 4 The observed 


speed enhancement (with respect to a_ single 
computer) is plotted versus the number of CPUs 
involved. If processor contention is negligible and 


each processor is able to contribute its complete 
capability towards the solution of the problem, the 
relative performance would simply equal the number 
of CPUs employed. 
investigated, the lower curve indicates that this was 
indeed the case. The middle line represents a more 
typical problem (with modest I/O requirements) and 
exhibits a speed enhancement approximately 50% 
greater than would be expected from the addition of 
single processors. The upper curve, which is a highly 
I/O intensive program, exhibits an even greater 
increase in relative speed. 


The Multiple Processor Array used in this Phase 
2 test can, as shown in Figure 3, receive external 
data from either the Primary Computer or a special- 
ized disc system supervised by a  Multiported 
Programmable Controller (MPC). This high-speed 
controller is part of the intelligent mass’ storage 
system scheduled for operation in the next phase of 
the project. Since development of this controller is 
not complete, data required by the test programs 


were supplied from disc drives on the Primary 
Computer. The maximum steady state data rate 
which the commercial disc controller package could 
sustain was approximately 650 KB/sec. In the I/O 


intensive test results shown in Figure 4, five CPUs 
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For the CPU-bound problems 
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Figure 4 Test results for MIDAS, Phase 2 


were more than sufficient to handle the analysis of 
this continuous data stream. The further addition of 
CPUs therefore resulted in no improvement in_ the 
processing speeds. This temporary restriction in 
handling highly I/O intensive problems’ will be 
alleviated when the new MPC system is operational. 
A single such controller will feature 3 independent 
channels of look-ahead dual-track buffering and will 
be expected to handle sustained speeds- of 
approximately 1.2 Mbytes/sec. 


It should be emphasized that although the Phase 
2 model shown in Figure 3 could run eight different 
programs simultaneously, the test results shown in 
Figure 4 require that the processors work coopera- 
tively on a single problem. Due to problems of com- 
munication, control and memory conflicts, commercial 
multiprocessor computer systems are not able to 
deliver one-to-one or even linear speed increases with 
an increasing number of processors. This is true 
even when the CPUs are operating on totally inde- 
pendent problems. For purposes of comparison the 
relative performance of a DEC VAX 11/782 (contain- 
ing two 11/780 CPUs) is also shown in Figure 4. 


According to DEC’s_ published’ measurements, the 
addition of a second CPU results in a 60-80% 
increase in performance. Two new_ systems, the 


ELXSI and Denelcor’s HEP, are specifically designed 
to reduce multiprocessor conflicts. 


Table 3 shows timing comparisons for a program 
running on both the CDC 7600 and the MIDAS Phase 
2 model. The CDC value reflects only the actual 
processing time and excludes both system overhead 
and I/O time. The MIDAS values are total processing 


The MIDAS results, labeled 
“standard code’ in Table 3, were obtained by 
utilizing as close a representation of the original 
CDC program as possible. Although the program 
heavily. utilizes floating point calculations, it also 
requires the I/O transfer of about 560 Kbytes of 
information during execution. After the _ initial 
comparison tests, the standard code was modified 
slightly to explicitly take advantage of some of the 
parallel aspects of the architecture. These minor 
changes resulted in a 20% increase in MIDAS 
processing speed, as shown in the last column of 
Table 3. Further investigation suggests that addition- 
al MIDAS-specific modifications to the program may 
effectively double the performance of the program. 
Additional information on programming MIDAS in 
general, and this problem in particular, is contained 
in Reference [5]. 


times, including I/O. 


Future Directions 


Figure 5 illustrates a potential layout for the 
next, or third phase of the MIDAS project. In this 
example, the Primary Computer is controlling five 
processing clusters. Such a structure would contain 
between 55 and 145 processors, depending on the 
implementation plan adopted. The control and com- 
munication between processor clusters adds a new di- 
mension to the operation of the facility. The 
Primary Computer may utilize multiple clusters on 
single problems in an analogous fashion to the way 
the Secondary controls multiple processors. The 
Phase 3 effort will, in addition, require the develop- 
ment of a multi-bussed, parallel-processor mass stor- 
age environment and a high-speed, interactive system. 


Subsequent development will also continue to 
increase the processing capability within a cluster. 
As illustrated in Figure 1, the existing Multiprocessor 
Array has two unused busses on the switch 
(Conductor). The number of CPUs within a cluster 
could, therefore, be tripled by duplicating the current 


8-CPU, time-multiplexed handler on each of these 
additional memory busses (the number of memory 
modules would also have to be _ increased). 
Alternatively the addition of multiple vector (array) 
processors to one bus is being investigated. Such 
Table 3 
MIDAS Performance Relative to CDC 7600 
(Execution Time in Seconds) 
———— MIDAS ———— 
1 "Standayd" Enhanced 

# CPUs CDC 7600 Code Code 

1 38.5 447 555-6 

2 223 176. 

3 147 117. 

4 111 87.3 

5 89 70.1 

6 74 58.8 

7 63 50.2 

8 55 44.3 

1) CPU seconds only 


2) Essentially identical to the original CDC code 
3) Code modified to utilize some specific MIDAS 
capabilities 
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Figure 5, Potential layout of a Phase 3 MIDAS sys- 


tem, containing 5 multi-processor clusters. 


parallel with all other 
They would be independent- 
like other processors, could 
information by means, of 
Communication, frequent- 


units would operate _ in 
elements of the system. 
ly programmable and, 
receive and_ transmit 


switchable memory blocks. 


ly a bottleneck in standard CPU-array processor 
configurations, would effectively occur at memory 
speeds with both’ processors free to _ continue 


concurrent operation. The Array Processors could 
thus be used to augment the standard processors for 
those portions of a problem amenable to the vector 
approach. 


Another area of active study will be the dynamic 
utilization of special purpose processors. Such 
processors could perform in hardware — specific 
calculations, or algorithms, which are not. amenable 
to parallel ap proaches (either vector or 
multiprocessor). These hardware modules (ultimately 
VLSI processors) could be incorporated on a vacant 
memory bus (Figure 1) and accessed from code in a 
manner analogous to a hardware’ subroutine. To 
further increase the processing power of a cluster, 
the Phase 3 development will also replace existing 
CPUs with new commercial processors at least 3 to 4 
times as fast. The performance of a number of new 
(or planned) CPUs, including micro-processors, are 
therefore being investigated. 


The main software thrust has been to permit 
existing programs to operate in a parallel environ- 
ment with minimum code changes. MIDAS is, of 
course, different from standard computers and _ is 
capable of handling problems in ways not possible on 
other machines. As was indicted previously, even 
small structural changes in code can _ significantly 
increase the execution speed on MIDAS [5]. For this 
reason considerable effort is underway to investigate 
different computational approaches to important 
problems and to develop tools and language constructs 
which permit users to exploit the full, and currently 
untapped, potential of this system. 


Summary 


Computers continue to assume = an_ increasingly 
important role in scientific research. Traditional 
serial computers are, however, approaching’ the 


theoretical limits imposed by heat dissipation and the 
speed of signal propagation. In the decade of the 
60°s computing speeds increased by a factor of about 
100 while during the 70’s, however, only a factor of 
10 increase in speed was realized. In order to 
achieve significant improvements in future computer 
processing, it is necessary to explore new parallel 
architectures. Until recently effort in this direction 
has focussed primarily on- parallel instruction 
execution (SIMD architectures), as in vector machines 
(e.g.. Cyber 205 or CRAY-1). While — such 
architectures can provide substantial processing 
capability for “vectorizable’ programs, many 
problems, particularly those dominated by conditional 
testing and branching, do not appear amenable to this 
approach. 


The MIDAS 
advanced computer 


represents research in 
and their specific 


project 
architectures 


application in  a_ scientific environment. The 
performance of the Phase 2 multiprocessor cluster 
was tested on aé variety of programs having total 


sustained I/O requirements spanning over 2 orders of 
magnitude (5 to 1000 KBytes/sec.) for periods of 15- 
20 minutes. In the worst case, the cluster performed 
n-times faster for n-CPUs. The maximum 
performance realized was approximately twice n. 
The tests were performed using existing programs and 
real data. A CPlU-intensive problem run on both the 
(CDC 7600 and MIDAS indicated that the current 
Phase 2 cluster was the equivalent of at least 85% 
of the 7600. The Phase 2 _ cluster required 
approximately $360K to develop and about 18 man- 
years of effort. 
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The utilization of multiple processors operating 
concurrently, such as employed in the MIDAS 
architecture, provides an alternative to achieve 
necessary high-speed computation in the future. The 
MIDAS Project is designed to fill a specific need 
within the scientific community. Experience at other 
research institutions and examination of current 
computer performance suggest that this need cannot 
reasonably be filled by currently available computers 
in any cost-effective manner. 
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THE HOMOGENEOUS MULTIPROCESSOR ARCHITECTURE —- 


STRUCTURE 


AND PERFORMANCE ANALYSIS 


Nikitas Dimopoulos 
Electrical Engineering Department 
Concordia University 
Montreal, Quebec, Canada H3G 1M8 


Abstract -- In this work, we present the structure 
of the Homogeneous Multiprocessor. The 
Homogeneous Multiprocessor is a tightly coupled 
MIMD architecture composed of two parts. The 
multiprocessor proper where each processing ele- 
ment communicates directly with either one of its 
two immediate neighbors through a distributively 
controlled switching network, and the H-Network 
which is a high speed (7 Mbytes/sec) local area 
network. 

A performance analysis of the two components 
of the architecture as well as possible applica- 
tions of the Multip1rocessor are also presented. 


Ll. Introduction 


In recent years multiprocessors have become 
important in solving problems where a large amount 
of computation is needed. Several multiprocessors 
have been proposed and built. 

A major architectural issue involved in the 
design of such machines is the availability of 
information paths that would enable the exchange 
of information between processors. Most of the 
existing MIMD designs have opted for a complete 
graph solution incorporating crossbar switching or 
microprogrammed controllers rendering the system 
either expensive or slow. 

The Homogeneous Multiprocessor discussed in 
this work resulted from the realization that in 
many applications (relaxation processes [6], neu- 
ral network simulation [4]) the complete graph 
solution is not a necessary attribute of the 
design, since such problems can be formulated in 
such a way so that each computational subtask 
would require information from only its neighbor- 
ing subtasks to complete the computation. More- 
over, such problems would benefit from architec-— 
tures that limit the scope of interprocessor com- 
munication, but make the limited communication 
paths fast. 

In this work we will describe one such multi- 
processor architecture, the Homogeneous Multipro- 
cessor. 


Il. Structure 


The Homogeneous Multiprocessor system, shown 
in Fig. 1, is a tightly coupled MIMD structure, 
and it is composed of two parts. The Homogeneous 
Multiprocessor proper and the H-Network. 

The Homogeneous Multiprocessor proper con- 
sists of k processors, k memory modules and k 
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local buses used by each processor to access its 
local memory module. Interbus switches, normally 
open, isolate the processor-memory complexes from 
each other allowing thus maximum throughput. Also 
the interbus switches provide the means of commun- 
ication between neighboring processors. Upon a 
request from either one of its two neighboring 
processors a switch may close connecting thus two 
neighboring local buses to form an extended bus. 
An extended bus is then utilized by the requesting 
processor to access the memory module of its 
neighbor and this in turn provides for 
interprocessor communications. As it may be seen 
in Fig. 1, there is no central controller to 
control the switches. Rather, associated with 
each one of the switches, there exists a switch 
controller which decides on the next state of the 
switch based on the status of the two neighboring 
switches and the presence of a request for the 
switch to close. 

Thus a switch s; can only exist in one of 
three states and it receives requests for service 
from its two adjacent processors p,; and Pi+i° The 
next state transition is based on the current 
State of the switches s;_1, 8; and S541 and it is 
computed according to Algorithm 1.2. 

The three states a switch can exist at, are 
as follows: 
—- OPEN: Denoted by 0. This state signifies 
that no requests exist or if a request 
exists it will not be honoured in the 
immediate future. 

Denoted by G. This signifies that a 
request has been acknowledged and ser- 
vice (i.e. switch closure) will be 
granted in the immediate future. 
Denoted by C. This state signifies 
that conditions are compatible for 
establishing an “extended bus”. 

The operational Algorithm 1.2 that decides 
the next transition is as follows: 


— GRAY: 


- CLOSED: 


Algorithm 1.2 


For the switch s. 

1. If no requests exist it becomes Open; 
if a request exists then: 

2. If Open it becomes Gray provided that the 
switch to its left (s;_}) is Open. Otherwise 
it remains Gray. 

3. If Gray it becomes Closed provided that the 
switch to its right (s;,,) is Open. Other- 
wise it remains Gray. 

4. If Closed it remains Closed. 

5. Switches sy and s,;,, are always open. 


A Pascal description of the above Algorithm 
1.2 is given in Fig. 2. It has been proven [3] 
that this algorithm provides for safeness (i.e. no 
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two neighboring switches will close at the same 
time) and liveness (i.e. if a switch is requested 
to close, it will do so in the near future). 

The second component of the architecture is 
the H-Network [2]. This is a high speed base-band 
local area network with a structure which resem- 
bles that of the Ethernet [5], yet it utilizes 
separate pathways for data transmission, network 
acquisition, and collision detection. These sepa- 
rate pathways allow the processes of data trans- 
mission and collision detection to proceed in 
parallel while the network acquisition interval is 
reduced to less than 100 ns. This short network 
acquisition interval increases the probability 
that only one station will become master of the 
network at any given time which results to an 
increased network utilization. 

As shown in Figs. 1 and 3, the H-network 
consists of four pathways plus Network Stations 
which interface the network to each processor of 
the Homogeneous Multiprocessor proper. The path- 
ways are: 

(a) The H-bus. This is the data highway which 
consists, in the present implementation, of 
16 data lines. Each Network Station inter- 
faces with the H-bus via an input and an 
output buffer which are implemented by using 
fast FIFO's (128 words x 16 bits/word). 

These buffers are used to capture an incoming 
packet and to hold an out-going packet until 
the station controller achieves mastership of 
the H-Network. 

The Access Line. This is a control line and 
it is used to ensure the mastership of the 
network by a single station with high proba- 
bility. A fast test and set module is used 
by the station controllers so that they can 
test and set the condition of the Access line 
in a very short interval of time (~ 100 ns). 
The ID line is used to detect possible colli- 
sions. A station controller while master of 
the network, transmits, and at the same time 
listens, its unique identification code over 
the ID line. If more than one station has 
gained mastership of the network their codes 
collide on the ID line. Such a collision can 
be detected by the transmitting stations and 
the transmission aborted. 

Timing and control. This group of lines 
facilitates the actual transmission of data 
through the H-bus. The data transmission 
protocol adopted is an asynchronous one with 
handshaking. Such a protocol is immune to 
varying signal delays imposed by the physical 
placement of the modules or by the technology 
used for their implementation. 


(b) 


(c) 


(d) 


The basic unit of information transmitted 
over the H-bus is the packet. As it can be seen 
in Fig.e 4 each packet consists of three parts, 
namely the header, the body and a checksum. The 
total length of the packet may not exceed, in this 
implementation, 128 words. This length coincides 
with the depth of the FIFO's used in the network 
stations. Three words have been reserved for the 
header. 

The first word carries the origin and desti- 
nation of the packet. The first word of the 
header is captured by the Temporary Register of 
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the network stations. This event happens before 
the commencement of the transmission of the 
remaining of the packet. Thus, the origin and the 
destination of any packet are made available to 
all the network stations prior to the packet 
transmission. Based on this information, the 
stations decide whether to capture the incoming 
packet, and communicate their decision (through 
the timing and control lines) to the transmitting 
station which releases the packet to the H-bus at 
high data rates (~ 7 Mbytes/sec). Packets with a 
destination of 00 are accepted by all the sta- 
tions. Such packets are used for network control 
and management. 

The second word of the header contains the 
length of the packet plus control information such 
as type of packet and retransmission number. 
Finally, the third word of the header has been 
reserved for use by the higher layers of the net-— 
work protocol. 

III. Performance 

The two components of the architecture have 
been analyzed separately. 

The Homogeneous Multiprocessor proper has 
been simulated and the average memory access time 
A for neighboring memory modules is given as a 
function of the average interarrival intervals \ 
of requests. This function, for a multiprocessor 
composed of 20 processor-memory complexes with a 
memory access time of 600 nsec, is given in Fig. 
5. 

For the H-Network we have assumed a l~persis-— 
tent CSMA-CD, where t is the packet transmission 
interval, 6 is the collision detection interval 
(6=10 usec) and a is the window of vulnerability, 
during which one or more H-stations may acquire 
the network (a~l00 nsec). We have also assumed N 
network stations, each one of which attempts to 
gain mastership of the network with a period T 
(T=3 usec). 

The network acquisition, packet transmission 
and collision detection cycles are outlined in 
Fig. 6. 

The N stations, which are continuously vying 
for the acquisition of the network, are modeled as 
a Poisson process with parameter Gy = —- There- 


fore, the utilization factor is calculated as 


——— 


=, where U is the average period of success- 
B+l = 

ful transmission, B is the average busy period 
(collision or successful transmission) and 


S = 


I is the average idle period. 


The quantities U, B and I are calculated as 
follows: 


U = t * P(successfull transmission) = 
= t ¢ P(no requests in interval a) 


U=te !l 
B= a+ tT P(successful) + 6&1 - P( successful) ) 
N N 
a Sse a ieee ig 
B=eatte - + 6(1 - e r ) 


and finally, 1=+=4. 
| Gy ON 
_Na 
T 
Therefore, s = ——————___s+__________ 
le _N, 
a+te T+ 6(l - e T ) ee 


A plot of S as a function of N for various 
t(in psec) is given in Fig. 7. 


IV. Discussion 


The two components of the Architecture, (i-e. 
the Homogeneous Multiprocessor proper and the H- 
Network) play a different role in the function of 
the Multiprocessor. 

The Homogeneous Multiprocessor proper is 
particularly suited for problems that can be 
modeled by a set of parallel processes operating 
independently of each other and occasionally 
exchanging information or control with their imme- 
diate neighbors. 

In general, the solution for such problems is 
calculated as: 


ttl _ ot 
x a. 


where 


ct t t 

Q° = [95,, eae 5,1 
is the operator describing the problem. Examples 
of such applications may be drawn from image pro- 
cessing through relaxation [6], digital filtering 
[1] and neural networks [4]. 

The H-Network will serve primarily as the 
main communications link of the Homogeneous Multi- 
processor proper and its environment. It will 
carry user and file traffic in cooperation with 
Front End and Back End processors. 

Also, the H-Network makes it possible for 
distant processors in the Multiprocessor proper to 


H-Network 


R/G 


Figure 3. The Homogeneous Multiprocessor 
M: Memory P: Processor 5: Bus Switch 
FE: Front End BE: Back End SC: Switch Controller 
T: Terminal  b: Local bus HS: Network station 


MS: Mass storage R/G: Bus request/grant 


communicate directly without resorting to time 
consuming hops of intermediate processors. 

In order to enhance reliability and to pro- 
vide for overall control of the structure, one of 
the processors of the multiprocessor proper 
together with its two immediate neighbors will 
serve as master of the structure. The master will 
share its data base with its neighbors. The mas- 
ter and its neighbors will repeatedly check for 
integrity of the structure. In the event that 
the master malfunctions, then one of the neighbors 
will take over. This becomes possible since the 
data base of the master is shared by its two 
neighbors. 

The designation of the master comes about 
through the H-Network. Upon power up or reset the 
processor whose network station first achieves 
mastership of the H-Network becomes master of the 
structure. 
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(Synchronization Algorithm 1.2 for the network of switches. 
k is the number of switches in the network. 7 
program Algorithm; 


type states=l(open,gray,closed!; 

state: arrayl0..k+1] of states; (array of the switch states} 
nxtstate: arrayl1..kJ of states; {next state array} 
request: arrayl1..k] of Boolean; 


procedure newstate (var i: integer]; 

if (requestCiJ=false!] then nxtstateLiJ:=open else 

“te begi 
case stateliJ of 
open: if{stateli-1J=open! then nxtstatelid:=gray else nxtstateliJ: =open; 
gray: if(stateli+1J=open! then nxtstateliJ:=closed else nxtstateCiJ: =gray; 
closed: nxtstateliJ:=closed; 


end (case) 
end 
end; (newstate) 
begin 


stateCl0J:=open; statelk+1):=open; 
while true do 
begin 
arbegin 
newstate(11; 
newstate(2); 
Figure 2 


newstate(k; The Synchronization Algorithm 1.2 
parend; : 
for the Network of the s Switches 
parbegin 
stateC1iJ:=nxtstatell1); 
statel2ZJ:=nxtstatelcJ,; 


stateCkIJ:=nxtstateCkI; 
arend; 
end (while) 
end. (Algorithm) 


|] EA: ft 


Busy id] 
4 6 FF ————__—_ <5 
Head 
eader body Transmission collision 
Figure 4. Packet cycle cycle 
S: Source D: Destination L: Length 
C: Control R: Reserved CS: Checksum Figure 6. H-Network acquisition, transmission and collision 
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Cedar—A Large Scale Multiprocessor 
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Abstract 


This paper presents an overview of Cedar, a large 
scale multiprocessor being designed at the University of 
Illinois. This machine is designed to accommodate 
several thousand high performance processors which are 
capable of working together on a single job, or they can 
be partitioned into groups of processors where each 
group of one or more processors can work on separate 
jobs. Various aspects of the machine are described 
including the control methodology, communication net- 
work, optimizing compiler, and plans for construction. 


1. Motivation 


The primary goal of the Cedar project is to demon- 
strate that supercomputers of the future can exhibit 
general-purpose behavior and be easy to use. The Cedar 
project is based on five key developments which have 
reached fruition in the past year and taken together offer 
a comprehensive solution to these problems. 


(1) The development of VLSI components makes 
large memories and small, fast processors 
available at low cost. Thus, highly parallel (e.g., 
1024 processors) systems are not ruled out by cost 
or physical volume considerations as they have been 
in the past. Particularly important are the 32-bit 
2.5 megaflop chips or chip-sets developed in the past 
year [WaMc82]. Thus, basic hardware building 
blocks will be available off-the-shelf in the next few 
years. 


(2) Given the hardware components for a highly paral- 
lel system, accessing a parallel shared memory and 
moving data between memories and processors has 
been a traditional architectural stumbling block. 
Many systems have been built that have severe con- 
straints in the memory (e.g., access to columns only) 


This work was supported in part by the National Science Founda 
tion under Grants No. US NSF MCS81-00512 and 
US NSF MCS80-01561, the US Department of Energy under Con- 
tract No. US DOE DE AC02-81ER 10822, and by the Department 
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or interconnection network (e.g., nearest neighbors 
only). Based on many years of work, it is pos- 
sible to have a shared memory and switch 
design which will provide high bandwidth 
over a wide range of computations and appli- 
cations areas. 


Compilation for parallel, pipeline, and multiproces- 
sor systems has been another serious traditional 
problem. The  Parafrase project has 
demonstrated that by restructuring ordinary 
programs these supercomputer architectures 
can be exploited effectively. It has also been 
shown that Parafrase can restructure programs to 
effectively exploit various levels of a memory hierar- 


chy. An important consequence is that a compiler 


can be used to manage caches in a multiprocessor 
and thus avoid cache coherency problems. 


The control of a highly parallel system is another 
problem of long-standing concern and controversy. 
It is probably the most controversial of the five 
topics listed here, mainly because it seems to be the 
least amenable to rigorous analysis. From an 
abstract viewpoint, the traditional dataflow 
approach seems best because control is distributed 
out to the level of operations on scalar operands. In 
practice, it seems that dealing with this low level of 
granularity has many weaknesses. By using a 
hierarchy of control, we have found that 
dataflow principles can be used at a high level 
(macro-dataflow), thus avoiding some of the 
problems with traditional dataflow methods. 
We have also demonstrated that a compiler can res- 
tructure programs written in ordinary programming 
languages to run well on such a system. 


Algorithms for systems with concurrency have been 
studied for a number of years. Many successes have 
been achieved in exploiting the array parallelism of 
various pipeline and parallel machines. But there 
have been a number of difficulties as well. It has 
long been realized that some of these difficulties 
should be surmountable using a multiprocessor 
because the parallelism in such a machine is not as 
rigid as in array-type machines. Recent work in 


numerical algorithms seems to indicate great 
promise in exploiting multiprocessors without 
the penalty of high synchronization overheads 
which has proved fatal in some earlier studies. 
Furthermore, nonnumerical algorithms have been 
developed at a rapidly increasing rate in the past 
few years. These can generally use a multiprocessor 
more efficiently than a vector machine, particularly 
in cases where the data is less well structured. Our 
group has been active in developing both numerical 
and nonnumerical algorithms. 


To reach the goal stated in the opening paragraph, we 
believe that a two-phase approach is necessary. The first 
phase is to demonstrate a working prototype system, 
complete with software and algorithms. The second 
phase would include the participation of an industrial 
partner (one or more) to produce a large scale version of 
the prototype system called the production system. 
Thus, the prototype design must include details of scal- 
ing the prototype up to a larger, faster production sys- 
tem. 


Our goal for the prototype is to achieve Cray-1 
speeds for programs written in high level languages and 
automatically restructured via a preprocessing compiler. 
We would expect to achieve ten to twenty megaflops for 
a much wider class of computations than can be handled 
by the Cray-1 or Cyber 205. This assumes a four clus- 
ter, 32-processor prototype where each processor delivers 
2.9 megaflops. 


The production system will use processors that 
deliver over 10 megaflops, so a 1024 processor system 
should realistically deliver (through a compiler) several 
gigaflops by the late 1980s. Actual speeds might be 
higher if (as we expect) our ideas scale up to more pro- 
cessors, if higher speed VLSI technology is available, and 
if better algorithms and compilers emerge to exploit the 
system. 


An integral part of the design for the prototype and 
final system is to allow multiprogramming. Thus, the 
machine may be subdivided and used to run a number of 
jobs, with clusters of eight processors, or even a single 
processor being used for the smallest jobs. 


2. The Cedar Architecture 


In order to integrate the discussion, we show in Fig. 
1 an overall system diagram. More details of our prelim- 
inary view of the system are discussed in [GLPV83]. 


2.1. Processor Cluster. A Processor Cluster (PC) is 
the smallest execution unit in the Cedar machine. A 
chunk of program called a Compound Function can be 
assigned to one or more PCs. 


A PC consists of n processors, n local memories, and 
a high speed switching network that allows each proces- 
sor access to any of the local memories. Each processor 
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Figure 1. Overall system diagram. 


can also access its own local memory directly without 
going through the switch. In this way, extra delay is 
incurred only when the data is not in its own local 
memory. Furthermore, each processor can _ directly 
access global memory for data that is not in local 
memory. Our compiler is targeted at exploiting this 
hierarchy of memory access speeds. 


Each processor consists of a floating-point arith- 
metic unit, address generation unit, and Processor Con- 
trol Unit (PCU), with program memory. There are no 
programmer accessible data registers in the processor. 
However, the local memory is dynamically partitionable 
into pseudo-vector registers of different sizes, and so it 
serves really as a large set of general-purpose registers. 
There are two reasons for this type of cluster organiza- 
tion. Firstly, it simplifies the compiler design. Secondly, 
there is no need for general-purpose registers since off- 
the-shelf floating-point arithmetic is an order of magni- 
tude slower than medium size static memories (500 ns vs. 
50 ns). Each local memory has its own global memory 
access unit that allows movement of data between global 
and local memories to proceed concurrently with the 
computation. 

The entire PC is controlled by the Cluster Control 
Unit (CCU), which mostly serves as a synchronization 
unit that starts all processors when the data is moved 
from global memory to local memory and signals the 
Global Control Unit (GCU) when a compound function 
execution is finished. 


In this paper we discuss two different machine sizes: 
the prototype and production machine. The prototype 
machine is a 4 cluster (8 processors per cluster) machine 
built for the purpose of debugging architectural and 
software concepts and Justifying performance estimates 
for a broadly chosen set of applications. An architectur- 
ally and technologically upward scalable production 
machine is a 64-128 cluster (8-16 processors per cluster) 


high performance supercomputer. 


To obtain short design time, we will use for the pro- 
totype machine off-theshelf components, standard 
memory chips, and gate arrays, while the production 
machine will use custom VLSI parts and high density 
packaging technology. 


Communication between disks, etc., and global 
memory will be through a special I/O cluster. An I/O 
cluster is equivalent to a PC except for the processors 
themselves. Instead of the usual processors, the I/O 
cluster will have communication processors. These in 
turn will connect to Extended Storage (solid state disks) 
which serves as a buffer between Cedar and the support 
machines (e.g., VAX) which will provide access to disks, 
terminals, etc. 


2.2. Global Network 


Large scale multiprocessors require access to a 
shared memory system and convenient interprocessor 
communication. Early parallel computers tended to be 
mesh-connected—that is, access to neighboring proces- 
sors and memories was fast, but more global 
communication/access took proportionally more time. 
Vast amounts of manpower were expended devising spe- 
cial algorithms which could execute in this type of 
environment. (Pipeline processors are not immune to 
this problem—the performance degradation due to non- 
unit vector strides or irregular addressing patterns are 
generally recognized problems.) 


Our network is based on an extension of the omega 
network [Lawr75] and is similar in concept to the omega 
network designed for the Burroughs FMP machine 
[Burr79], [BaLu8i]. That network was nominally 
1024x1024, and was a circuit-switching network where 
the data path at each node was 11 bits wide. They 
estimated that the minimum time required to set up a 
connection would be 120 ns. 


Our initial design differs in several respects from 
the FMP design. It is based on the use of 8x8 switches 
located on 160 pin boards, rather than 2x2 switches. 
Taking into account expected delays due to conflicts, 
time multiplexing of 120-bit packets, memory access, 
and return transmission, we estimate an expected delay 
time of less than 2 ys/1024 words between processor and 
memory. (Using the same techniques we can design net- 
works to provide average global memory access in as lit- 
tle as 500 ns, but these designs would require as many as 
8 boards per processor.) 


An example of one of these networks connecting 16 
processors to 8 memories is shown in Fig. 2. This exam- 
ple uses 4x4 switches, but illustrates the principles we 
will use in constructing the 1024 port global network. 
We have discovered ways to add redundancy in larger 
networks that allow us to use this redundancy both for 
conflict avoidance and fault tolerance [Padm83]. Notice 
that unlike the omega network, this network allows more 
than one path between any processor port and any out- 
put port. This path redundancy provides both fault 
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Figure 2. Example of a global network connecting 
16 processors with eight memories. Notice the 
redundant paths from processor 4 to memory 5. 


tolerance and conflict avoidance. Thus, from every 
switch (except the last) there are at least two valid 
paths. If either of these is either blocked by another 
message or by a failure, a connection can still be made 
via an alternate path. (A total blockage can exist if ail 
alternate paths are blocked by conflicts with other mes- 
sages and/or faults. However, analytic and simulation 
results indicate that the probability of conflicts is 
significantly lower with the redundant paths than 
without them, and that the probability of there being 
enough faults to block a message is so small that the 
mean time between fault-blocked-messages is on the 
order of one year even for the production machine.) The 
control logic which allows this conflict/fault avoidance is 
distributed throughout the network and is not very 
different from the classical omega control algorithm. 


2.3. Memory System 


The overall memory system has a great deal of 
structure to it, but the user need not concern himself 
with anything but the global shared memory. However, 
the fast local memories present in the design can be used 
to mask the approximately 2 ms access time to global 
memory. Each cluster of eight processors contains eight, 
16K local memories A given processor can access its own 
local memory module directly, or any local memory in its 
cluster through the cluster network. 


User transparent access to these local memories will 
be provided in several ways. First, program code can be 
moved from global to local memories in large blocks by 
the cluster and global control units. Time required for 
these transfers will be masked by computation. Second, 
the optimizing compiler will generate code to cause 
movement of blocks of certain data between global and 
local memory. ‘Third, automatic caching hardware 


(using the local memories) will be available for certain 
data where the compiler cannot determine @ prtori the 
details of the access patterns but where freedom from 
cache coherency problems can be certified. 


All levels of memory include operand level syn- 
chronization facilities (similar to the full/empty bit of 
the Denelcor HEP [Smit82]), and the global shared 
memory includes the (programmer) option of virtual 
memory. 


Figure 3 shows a programmers’ view of the memory 
system. Both the cluster and local memories include 
cache space (which is physically implemented in the 
local 16K memories) for global memory accesses. This 
cache is different from most cache memory schemes in 
that not all global memory accesses are cached—only 
those predetermined by the programmer or compiler. In 
this way, we avoid the cache consistency problems which 
plague most multiprocessor cache designs. Thus, only 
read-only data (or data that is determined by the com- 
piler to be read-only during a short phase of a program) 
or data that is only written by a single processor (private 
data) is cached. 


Thus a user need only be concerned with a single 
uniform globally shared memory, and he could quickly 
design a program to execute from this memory. When 
he is satisfied with his results, he can use the optimizing 
compiler to improve his performance by making better 
use of the entire memory hierarchy and by utilizing more 
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Figure 3. Programmers’ view of the memory system. 


parallelism. 


2.4. Global Control Unit. The execution of a pro- 
gram is limited by the parallelism exhibited by the con- 
trol mechanism. In a von Neumann machine, the paral- 
lelism is limited by a serial control mechanism in which 
each statement is executed separately in the order 
specified by the program. 


The execution speed can be increased by using 
parallel control flow or dataflow mechanisms [TrBH82]. 
Each of these mechanisms tries to execute all indepen- 
dent operations in parallel, where the operation is a typi- 
cal arithmetic operation (e.g., addition, multiplication, 
etc.) or control operation (e.g., decision). However, the 
number of resources (e.g., operational units) in the 
machine is limited and sometimes not all independent 
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operations can be executed in parallel. Therefore, the 
resources must be allocated and deallocated in the order 
specified by the computation. The price paid for paral- 
lelism is in the form of extra time or hardware needed to 
allocate operational units to instructions and keeping 
track of the execution order, the process we call schedul- 
ing. Proposed dataflow architectures are very inefficient 
on regular structures because of this fine granularity of 
their operations. When data is structured (vectors, 
matrices, records), the control and dataflow is very regu- 
lar aud predictable and there is no need to pay high 
overhead for scheduling. 


In our system, we adapt to the granularity of the 
data structure. We treat large structures (arrays) as one 
object. We reduce scheduling overhead by combining 
together as many scalar operations as possible, and exe- 
cuting them as one object{Corn81]. In our machine, each 
Processor Cluster (PC) can be considered as an execution 
unit of a macro-dataflow machine. Each PC executes a 
chunk of the original program called a compound func- 
tion (CF). 

From the GCU point of view, a program is a 
directed graph called a flow graph. The nodes of this 
graph are compound functions and the arcs define the 
execution order for the compound functions of a pro- 
gram. 


The nodes in our graph can be divided into two 
groups: Computational (CPF) and Control (CTF). All 
CTFs are executed in the GCU, and all the CPFs are 
done by clusters. All CPFs have one predecessor and 
one successor. CTFs are used to specify multiple control 
paths, conditional or unconditional. 


The compound function graph is executed by the 
GCU. Each node requires two different types of action: 


(1) Computation of the original part of the program 
specified in the CF which may be done by the GCU 
itself or allocated to PCs. The latter case is for 
CPFs, and it requires their scheduling and prepara- 
tion. The CTFs either do not have this part at all, 
or perform computation related to control. 


(2) Graph update after the executable part of a node is 
done (if it had one). Successors of each node are 
updated and checked for readiness. The updating 
consists of recording that the predecessor node was 
executed. A node is ready when all its predecessors 
in the graph are done. When a node is finished, the 
predecessor information is reinitialized for the next 
execution of the node (if it is a cycle, for instance). 


The second problem of dataflow architectures is 
storage allocation, deallocation and movement of data, 
resulting in slow data access. In our machine, data is 
stored permanently in global memory and it can be 
shared there by all PCs. The data is moved into the 
assigned PC before the execution of a CF and later 
stored back to global memory. In this way, the move- 
ment of data is minimized while the order and locality of 
data is preserved. Thus, the macro-dataflow archi- 
tecture combines the control mechanism of 


dataflow architectures and storage management 
of the von Neumann machine. 


3. Software 


The primary language for Cedar will be Fortran 
(although we expect several other languages to become 
available as well). In Fortran, users will have a choice of 
writing programs directly in an extended Fortran (based 
on the ANSI 8X standard), or of using their old pro- 
grams as is. The powerful restructuring capabilities of 
Parafrase {[KuPa79] will usually be brought to bear on 
programs written in serial Fortran and may also (though 
not necessarily) be applied to programs written in 
extended (parallel) Fortran. Since the Parafrase system 
operates in a source-to-source manner, the user who used 
Parafrase can then choose to maintain his original code 
or the new, restructured version (thus obviating the need 
for further restructuring. [KPSW82]} 


The compiler will provide the Cedar system 
with code that may be regarded as a dependence 
graph containing several types of nodes called 
compound functions (CFs). This macro-dataflow 
graph is presented to the global control unit (GCU) 
which oversees its execution. Some CFs are control 
functions executed in the GCU itself, while other nodes 
are computational and are passed down to clusters of 
processors [GaKP8]]. 


Four important kinds of parallelism may be dis- 
tinguished in the Cedar macro-datafiow code. The first 
is parallelism between CFs themselves. This includes 
executing some CFs on the GCU and some in processor 
clusters, as well as executing several CFs at once in 
different processor clusters. 


The second kind of parallelism is in the loop control 
of the computational CFs. For example, all iterations of 
a loop may be executable at once and so each iteration 
can be assigned to a different processor. Of course, the 
GCU may fold a computation onto a limited number of 
processors and each processor will then do a number of 
iterations ({[KuPa79], [PaKL80]). 


Third, is a kind of pipelining effect achieved by 
moving data from global to local memories before it is 
needed for the computation of a CF. We can prefetch 
data for iteration i+-1 while computing iteration i, for 
example, or we can prefetch larger blocks. Experimental 
evidence shows that this approach will be effective in 
exploiting the local memories in clusters. This software 
cache management only works effectively for data that 
can be guaranteed (by the compiler) to be written at 
most once in a given phase of a computation. Other- 
wise, cache coherency problems can develop, and our 
solution to this is to force any nonsafe code to execute 
directly from the global shared memory. This will cause 
some speed decrease (by a constant factor) and should be 
a rather rare event in any case. The global memory has 
of course only one copy of the data, and hardware will 
ensure that the correct value is stored. 
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The fourth kind of parallelism will be exploited 
mainly for phases of computations in which there is less 
parallelism than processors. This involves spreading 
expressions across more than one processor for execution. 
For example, if a loop of 50 iterations could be run as 50 
independent iterations, but our machine had 100 proces- 
sors available, two processors could be used for each 
iteration. This code spreading is entirely within indivi- 
dual compound functions and may involve executing 
independent assignment statements in distinct processors 
or even spreading single expressions over two or more 
processors. Experiments to date show that spreading 
can be very effective in some cases but it is not a first 
priority technique. 


Some standard operating system functions will be 
handled by our hardware, e.g., task scheduling in the 
GCU. The I/O clusters will handle some of the activities 
that are traditionally at the interface between the com- 
piler, OS, and I/O channels. In particular, we plan to 
have the I/O clusters execute I/O statements and do for- 
mat conversions. They will also handle page faults 
between the global memory and disk system. We also 
plan to attach front-end processors to the I/O clusters. 


A front-end processor will provide various user ser- 
vices. We would expect it to be a network node in any 
installation and in the Department of Computer Science 
at the University of I[lhnois we will attach it to a 
VAX/Ethernet network within the department. The 
point is that users should be able to access the system 
through an interface with which they are familiar and 
happy (VMS, UNIX, NOS, or whatever). Thus, a user 
would submit a job through a front-end processor which 
sends it to an I/O cluster, which in turn can initiate I/O 
directly or begin execution through the GCU. Results 
will be returned through the I/O cluster to the front-end 
processor for output, graphics display, etc. In this way, 
we hope to make the architectural details of the Cedar 
system as invisible to the user as possible. 


4, Summary 


The Cedar architecture nicely integrates the five 
key developments sketched in the introduction. We 
believe that the Cedar system will deliver high perfor- 
mance over a much wider range of applications and algo- 
rithms than today’s supercomputers can _ handle. 
Because the Cedar clock speed is slow relative to such 
systems, the complexities of building and manufacturing 
this system are substantially reduced. Due to the ever- 
decreasing costs of integrated circuits and the relative 
ease with which the Cedar design can be partitioned, we 
feel that the monetary cost per megaflop will be much 
lower than could be achieved by attempting to push 
today’s pipelined supercomputers to higher speeds. 
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VECTOR OPTIMIZATION ON THE CYBER 205 


CLIFFORD N. 


ARNOLD 
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ARDEN HILLS, MINNESOTA 


ABSTRACT. Vector optimization is defined as 
generating the best object code for a given 
vector computation for a given machine. In this 
paper an analytical performance model _ is 


developed for both scalar and vector source code 
as executed on the Control Data CYBER 205. The 
accuracy of these models is typically within 30% 
for scalar code and within 104 for vector code. 


If the compiler can generate more than one 
version of code -—_— for a given parallel 
computation, then performance estimates from 


these models can be used to choose which version 
should be executed. Sixteen FORTRAN kernels 
with two or more versions of source code were 
used as a benchmark to test this method. 
Thirteen kernels were "vector optimized" 
correctly. The three kernels not’ properly 
optimized had an average performance penalty of 
1/% +Using the set of kernels as a benchmark, 
vector optimization improved its performance by 
more than a factor of four, and obtained 984 of 
the improvement possible had all kernels been 
correctly optimized. 


Ts INTRODUCTION 

Automatic vectorization is the process by which 
a serial or scalar version of source code is 
analyzed by a piece of software to discover the 
code’s inherent parallelisn. The result is a 
transformation of the original code which 
expresses the parallelism discovered by the 
analysis. This can take many forms, for example 
rewritten source code using a parallel dialect, 
ma chine dependent object code, and many 
expressions in a variety of languages in 
between. A lot of work has gone into automatic 
vectorization software, both in the computer 
industry and in academia. Performance analysis 
of three such software packages is reviewed in 
[1]. There the concept of vector optimization 
was introduced; "The goal of vector optimization 
is to generate the best object code for a given 
vector computation (for a given computer)." 

Little study has been done in this area, which 
is unfortunate. As users of vector computers in 
the field know all too well, code vectorization 
to a higher level of parallelism can in some 
cases slow the computation down. 


This paper addresses the following question. As 
information pertaining to the vectorizability of 
a kernel is increased, will a kernel always run 


at least at the same speed as before with the 
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possibility of being 
Clearly the scalar 


sped up many times? 
(original) version of the 
code can be used, thus guaranteeing the same 
execution speed as before vectorization. To 
make vectorization analysis worth the trouble, a 
vector optimization ttool needs to determine 
correctly when to choose the scalar version of 
the code and when to choose the vector version. 
In many cases more than one vector version is 
possible, and in such cases a decision among 
these versions also needs to be made. Figure l 
below illustrates the issue. Three different 
possible code solutions are depicted with their 
performance assumed to be a function of some 
data dependent parameter set. 


.° SOLUTION @) (...) 


SOLUTION @) (- - -) 


PERFORMANCE 


SOLUTION (1) [- - -) 
(ORIGINAL SCALAR) 


DATA DEPENDENT PARAMETER SET 
(E.G., VECTOR LENGTH, STORE DENSITY) 


FIGURE l. Performance Profile of Three Code 
Versions of the Same Kernel 


Vector optimization would hopefully choose the 
bold face line. If it did not successfully do 


that, it would successfully avoid the shaded 
areas. 

I have appiied a simple but surprisingly 
successful model to estimate the timings of 


different solutions based on their respective 
source code versions alone. These estimates are 
then used to make the optimization decisions. 
In this paper I describe the timing model for 
vector performance (Section I1) and_ scalar 
perfomance (Section III) for the CYBER 205. 
Validation of the model against 16 kernels from 
the Livermore Loops (3) as run through several 
vectorizers (see [1]) is described in Section 
IV. Results are discussed in Section V. 


Il. MODELING VECTOR PERFORMANCE OF THE CYBER 


205 


Assume the time to execute a vector operation is 
a linear function of the vector length n: 


time = S+R#n (1) 
Equation (1) is rewritten by Hockney and 
Jesshope [2] as: 

CS Gh np yo) / Fp} (2) 
Where’ fr[ c is the asymptotic performance 


(MFLOPS) at infinite vector length and n[1/2] 
is the vector length required to achieve half 
the asymptotic performance. Note that t is then 
in microseconds. The coefficients in Equation 
(1) are defined in terms of r[o] and nj]/2] 
as noted below: 


R 


1/r{ 0 ] (microsec/result ) 


(3) 


S N{1/2]/T[ @] (microsec) 

R and S. were measured for 34 vector instructions 
using vector lengths ranging from 2 to 8192 in 
powers of 2. The timings proved to follow the 
model of Equation (1) very well. The results 
are listed in Table 1. The data for R is 
Significant to two digits and for S to three 
digits, with a few exceptions shown in the 
list. Typical values for R range from 0.01 to 
0.03 microsec/result, and for S range from 1 to 
3 microsec (n[]/2]~100). 


Test kernels with several vector operations had 
timings entirely consistent with the timings of 


the individual vector operations added 
together. This is predicted by the linear model. 
IIIf. MODELING SCALAR PERFORMANCE OF THE CYBER 


205 


Initially Equation (1) was also applied to 
scalar loops where n was the trip count of the 


loop. Over a hundred short test kernels were 
timed. Though the resulting timings were 
linear, it was found that R was not constant for 


a given instruction in different loop contexts. 
This should not be surprising. For example, if 
loads and stores for an add operation can be 
overlapped by a multiply operation in the same 


loop, the rate is clearly different than if no 
overlap can be_- scheduled. R for an_ add 
operation was found to range from 0.42 
microsec/result to 0.08 microsec/result 
depending on the context of the loop. Figure 2 
shows the dependence of R on the number of 
operations (I) in the _ loop. For adds and 
multiplies this can be fitted well by: 

R = 0.08 * exp(1.61/1) (4) 

t(scalar loop) =S+R*n#*I (4a) 
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For divide and square root, R is constant: 


R = 1.1 microsec/result (5) 
eo» @-—-—-—-e-— —-——- o— — —-— — oe 
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FIGURE 2. R in Scalar Loops 


More testing showed R to also be a function of 
whether the operands were in memory or in the 
scalar register file (256 registers). All 
dimensioned variables (indexed operands) are in 


memory whereas as many scalar variables as 
possible are in the register file, for which 
operations are significantly quicker. Thus 
Equation (4) was fitted to include register to 
register operations: 
R[scalar] = (0.08 — 0.03*(J/(J+K))) 
* exp (1.61/I1) (6) 


where J is the number of scalar references and K 
is the number of indexed references in the 
loop. (Note: I ~ J+K). 


The behavior of S and R is surprisingly simple 
for scalar loops. S is always less than 1.0 
microsec, and usually about 0.25 microsec. Each 
IF statement in the loop (assuming no nested 
IFs) adds 0.7/1 microsec/result to R and has no 
effect on S. For nested structures, like nested 
Do Loops, simply evaluate the inner most loop 
first and then have it act as an in-line routine 
with its characteristic I, J, K and S added to 
the other statements in the next level of the 
nest structure. 


Initially it seemed unlikely that the _ scalar 
unit could be timed as simply as noted above. 
The FORTRAN compiler’s scalar optimizer and the 
hardware were essentially being treated as a 
black box. It is interesting and curious to 
suggest that it can be timed empirically without 
knowledge of its detailed operation. 
Investigators might want to try this for other 
machines. 


TABLE 1. INSTRUCTION TIMING 
MODEL FOR C205 VECTOR OPERATIONS 


Operation S (microsec) R (microsec/result) n[1/2]= S/R 
ASSIGNMENT 1.00 0.010 100 

ADD (Floating Pt.) 1.00 0.010 100 
MULTIPLY (Floating Pt.) 1.04 0.010 104 
DIVIDE (Floating Pt.) 1.60 0.031 52 

ADD (Integer ) 1.04 0.010 104 
MULTIPLY (Integer) 3.72 0.040 93 
DIVIDE (Integer ) 2.50 0.041 61 
Q8VINTL 0.95 0.020 48 
Q8VGATHR (Index List) 1.50 0.027 to V.081 53 
COMPARE (Floating Pt.) 1.35 0.010 135 
Q8SSUM 2.50 0.020 125 
Q8VSCATR (Index List) 1.38 0.025 to 0.081 55 
Q8SDOT 2.65 0.020 133 
Q8VMASK 1.90 0.010 190 
Q8VCMPRS 1.38 0.010 138 
Q8VEXPND | 1.45 0.010 145 
Q8VGATHP (Periodic) . 0.83 0.029 29 
Q8VSCATP (Periodic) 1.50 0.024 125 

Q8SMA X 1.50 0.020 75 

VSQRT 1.55 0.030 52 

LINK ([S(4+)V] (HV) 2.84 0.010 284 

LINK ({[V(+)V](4+)S) 2.60 0.010 260 
VIFIX 1.05 0.010 105 
VFLOAT 1.05 0.010 105 
Q8VMKO (Z ) 1.04 0.0013 800 

BIT COMPARE 1g22 0.00125 976 

WHERE (Logical ) 0.20 0.0 # 

WHERE (Expression) 1.00 0.0 # 

STACKLIB TRIADS 5.716. 0.144 35 to lll 
STACKLIB DIADS 2.~ 4, 0.113 to 0.128 18 to 35 
Q8VMERG 1.50 0.010 150 

VEXP 24. 0.5 to 0.6 40 to 48 
(REAL)**REAL 24. 0.05 to 0.06 400 to 480 
VS IN i9. 0.28 to 0.42 45 to 68 
Q8VCTRL 1.10 0.010 110 


Each result takes two (2) floating point operations. 


The WHERE statement only has start-up time. The time per result is zero. If there is an 
expression in the WHERE statement its timing must be calculated separately and then the WHERE 
statement start-up added to it. 


(+) This signifies either the multipication or addition operator in the Link instruction. In a 


given Link on the Cyber 205 there can not be more than one add or multiply. There are other 
types of operators which can be Linked, but these were not timed. 
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IV. PERFORMANCE MODEL VALIDATION 

A test base of 18 FORTRAN kernels from the 
Lawrence Livermore Laboratory [3] were used just 
as in [1]. For each loop there were at least 
two different source code versions, scalar and 
vector, and for loops 2, 4, and 18 there were 
two or more vector versions. Kernels 16 and 1/ 
were discarded because the unknown data 
dependent branching probabilities made these 
loops impossible to time. Note that the same 
was not true of Loop 15 which has many IF 
statements in the original code. 


Using Equations (1), (4a), (5), and (6) and 
Table 1, the test base kernels were timed using 
the source code versions alone. Scalar times 
converted to MFLOPS are listed in Table 2. 
These are compared to the actual timings with 
the relative error noted. On the average the 
predicted performance is 154 too high. More 
interesting is the spread in the relative 
error. In statisics this is called the "sample 
varience"’ or "standard deviation," and is 
annotated as "sigma." For this sample the 
scalar timings have a sigma of about 31%. Thus, 
over a performance range of 1./ to 22.4 MFLOPS, 
scalar timings can be predicted repeatedly to 
within +314 using a very simple three parameter 
equation (S, J, and I). Figure 3 shows the 
distribution of errors in a histogram format. 


Vector timings, converted to MFLOPS are listed 
in Table 3. The triple entries represent three 
different vector lengths where typically the 
second length is twice the first, and the third 
vector length is twice the second (see [1]). 
Note that performance is well predicted over a 
large ~span of vector lengths and over a 


FREQUENCY OF OCCURRENCE 


RELATIVE ERROR 


FIGURE 3. Histogram of Relative Error for 


Scalar Loop Timing Predictions 
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performance range of 3.4 to 168 MFLOPS. ~ Loops 


2a, 4, 8, 13, 14, 15, 18 are examples of kernels 
that are tricky timing exercises requiring index 
list gathers and scatters, partial 
vectorization, bit vector operations, WHERE 
blocks, multiple nested loops, and up to 73 
timing equations (Loop 15). 

Figure 4 shows in a histogram format’ the 


distribution of errors for this test base in 
vector mode. Loop 13 is a statistical "outlier" 
at more than the 3 sigma level, implying that it 
is not a statistical anomaly but a technical one 
not satisfied by the model. Therefore it should 
not be used in the sample for calculating the 
sample mean oor variance. With that loop 
eliminated, Loop 14 is a statistical outlier at 
the 2 sigma level, suggestive, but not a 
compelling reason to delete it from the sample. 
Ignoring Loop 13, the average predicted 
performance is about 54 too high. The spread in 
the relative error (sigma) is less than 104. 


Loops 13 and 14 both involve random indexing 
through memory.e My timings for Q8VGATHR and 
Q8VSCATR assumed a "well behaved list," and they 
likely are not that well behaved. More tests 
showed the performance could degrade by a factor 
of 3 in R in the worst casee Thus a range in R 
is shown in Table 1. These ranges more _ than 
make up for the error in the predictions of 
Loops 13 and 14. A good working value of R for 


these two instructions is 0.035 
microsec/result. Making this correction to the 
performance predictions of these loops. brings 
these estimates into line with their actual 
performance. 
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RELATIVE ERROR 


FIGURE 4. Histogram of Relative Errors for 


Vector Code Timing Predictions 
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TABLE 2. C205 PERFORMANCE MODEL - Scalar Code 


Predicted (MFLOPS) 
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TABLE 3. C205 PERFORMANCE MODEL - Vector Code 


Predicted (MFLOPS) 


89.7, 116.6, 137.2 
12.3, 16.9, 20.8 
65.4, 79.1, 88.3 
65.4, 79.1, 88.3 
13.5, 22.9, 30.0 
9.5, 16.1, 21.1 
5.8, 6.2, 6.5 

6.1, 6.5, 6.7 
93.6, 127.6, 155.8 
3.2, 18.1, 18.9 
52.0, 79.6, 108.5 
22.4, 30.5, 37.1 
7.6, 7.9, 8.1 
71.4, 83.3, 90.9 
7.7, 8.8, 9.4 

6.2, 6.5, 6.6 
21.4, 29.2, 35.7 
44.1, 54.4, 61.6 
38.3, 47.4, 53.8 
4.2, 4.2, 4.2 


All 16 Kernels 


All 16 Kernels except #13 


Actual (MFLOPS) 
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Actual (MFLOPS) 


94.4, 122, 138.9 
12.5, 16.2, 19.8 
64.6, 76.9, 87.2 
64.6, 76.9, 87.2 
122 Died. 2961 


8.7, 
5.1, 
5.4, 
113, 
3.4, 


15.3, 20.4 
5.7, 5.9 
6.1, 6.4 
146, 168 
15.9, 16.6 


54.4, 81, 110 


2267: 


8.0, 


30.0, 36.1 
8.5, 8.6 


62, 75, 83 


3.9, 
5.0, 


4.4, 4.5 
503, 502 


18.4, 25.6, 30.3 
42.3, 50.4, 55.6 
35.0, 42.4, 47.3 


iD 


4.2, 4.2 


Relative Error 


15.6% 
-10.62 
8.5% 
73.64 
21.5% 
62.5% 
-11.22% 
—32.14 
4.64 
36.04 
47.0% 
-13.82 
51.64 
-14.5% 
11.84 
12.04 


14.9 + 31.3% 


Relative Error 


-3,. 5% 


2.6% + 3.6% 


1.84 
1.84% 
4.8% 
6.04% 
10.9% 
8.1% 


-12.3% | 


7.3% 
-2.5% 
1.0% 
-6.0% 
11.9% 

102% 
24.5% 
16.1% 
17% 
11.7% 
0% 


9.7% + 23.2% 


4.8% + 8.4% 


+ 11.4% 


+ 2.1% 


V. RESULTS 


For 9 of the 16 kernels in this test base 
(kernels 1, 2, 3, 7, 9, 10, 11, 12, and 15), 
vector optimization clearly discriminates among 
the possible source code choices. The 
likelihood of an error being made in any of 
these choices is quite small. The differences 
in their respective scalar and vector 
performance predictions far exceeds the sum of 
their expected errors, in all cases by more than 
2 sigma and in most by more than 3 sigma. Of 
these 9 kernels, kernels 2, and 15 were subtle 
timing exercises (Section IV). 


Making the proper choice of source code for 
kernels 4, 5, 6, 8, 13, 14 and 18 is not so 
Straightforward. Figure 5 shows the probability 
of making an error in code choice for a kernel 
as a function of the separation of the 
performance estimates. An error is made when 
the code version with the faster time estimate 
turns out to be the slower running version. In 
Figure 5, Ml and M2 are the_- respective 
performance estimates of two choices of source 
code for the same kernel. Kernels 2a, 4ab, 5, 
6, 8, 13, 14 and 18ab are noted on the figure as 
examples. The average overestimation of vector 
performance (54) and scalar performance (154) 
has been factored out in these cases. The 
standard errors, Sl and S2, for scalar and 
vector estimates are 31% and 10% of their 
respective estimates (Mv and Sv for the vector 
estimate and Ms and Ss for the scalar one). Out 
of these 8 examples, it is expected that one 
(rounding to the nearest integer) will be 
wronge In fact, three of them are (kernels 6, 
8, and 14). The slower versions are 15%, 29%, 
and 44 slower than their faster estimates not 
chosen. It should be noted that the scalar 
speed of kernel 8 is unusually fast (rumor has 
it that the compiler was tured on _ this 
particular piece of code). 


Given an incorrect choice of the version of 
source code, statistics predicts how much error 
will likely cost in performance. Where Figure 5 
shows the probability of making an error, Figure 
6 shows the probability of the size of the 
penalty as a function of the separation of the 
estimates. This assumes that the error has been 
made. Kernels 6, 8, and 14 are included as 
examples. This "penalty" function is fairly 
flat with the average penalty for a large sample 
of incorrect choices typically in the range of 
594 to 20%. The top curve represents the 904 
confidence interval of the penalty. Therefore 
the probability that the penalty lies below that 
line is 904. 
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VI. Conclusions 

The analytical timing model presented in this 
paper estimates the performance of scalar 
kernels with a standard deviation of 30%. The 
performance of vector kernels is more accurately 
predicted with standard deviation being 10%. In 
Many cases these errors are insignificant 
compared to the difference in predicted timings 
for two coded versions (e-g-e scalar versus 
vector) of the same kernel. In such cases 
vector optimization is eminently safe and 
useful. One class of kernels that will 
repeatedly fall into this category on the CYBER 


205 are the kernels whose vector performance is 


The test base 
which vector 
the 


predicted to exceed 25 MFLOPS. 
showed some slower kernels’ for 
optimization also clearly discriminated 
faster version. 


When vector optimization chooses between two 
versions of a kernel whose performance 
difference is comparable to the expected errors, 
the probability of making an error in choice is 
no longer negligible. When choosing between 
scalar and vector code, the chance of error is 
8% when the performance difference is 40% of the 
average estimate. The chance of error grows to 
42% when the performance difference is 104. 
When choosing between two vector versions, the 
probability of error is 0.5% at a performance 


difference of 40% of the average estimate, and 
22% at a performance difference of 10%. When an 


error is made the typical performance penalty 
incurred ranges from 5 to 20%. Rarely (less 
than 104 of the time) would the penalty exceed 
50%. These conclusions assume that the errors 
in timing predictions obey Normal Distribution 
Statistics. This assumption looks correct for 
the vector timing predictions, but is suspect 
for the scalar ones. This code sample implies 
that high performance scalar code is selectively 
underestimated, while the slow performance 
scalar code is selectively overestimated. 
Perhaps the analytical timing model for scalar 
performance needs a nonlinear term to better 
match the high and low performance extremes. I 
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think this last point forces more interpretation 
than is really in the datae The model needs to 


be tested on more_ code, preferably real 
applications, and I intend doing this in the 
future. 


The bottom line for this experiment is that out 
of 16 kernels, 9 have clear vector optimization 
choices from among two or more choices. Of the 
remaining 7 harder to discriminate versions 3 
are chosen incorrectly for an average penalty of 
17%. Weighting all 16 kernels equally, the mean 
scalar performance is 8.1 MFLOPS while the mean 
vector performance is 32.1 MFLOPS (for the 
shortest vector lengths), 41.6 MFLOPS (for the 
middle vector length), and 48.3 MFLOPS (for the 
longest vector lengths). Vector optimization 
yields a mean performance of 37.7, 47.2, 54.7 
MFLOPS respectively for these vector lengths. 
If scalar code is chosen in all cases except 


where the best vector performance estimate 
exceeds the scalar one by 404 (less than 84% 


chance of slowing the code by choosing vector) 
then such a vector optimization algorithm would 
yield an average performance of 3/./, 48.0, 55.1 
MFLOPS. If the vector optimization decisions 
had been all correct (that is, best effort), the 
mean performance for this test base would have 
been 37.7, 48.1, and 55.2 MFLOPS. 


The vector optimization algorithm presented here 
should be easy to implement in a compiler or 
other automatic vectorization software. 
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Abstract -- A first-order recurrence is a 
sequence of evaluations in which the value of the 
latest term depends on the previously computed 
term. Due to the sequential nature, it presents a 
special problem for parallel processing. For most 
scientific applications, only the final term is 
desired. This paper presents various strategies 
to evaluate the final value of first-order 
recurrence using pipeline. Two methods, symmetric 
reduction and asymmetric reduction, are proposed 
and compared in a static pipeline environment. 
The pipeline utilization can be further improved 
when multiple recurrence systems are evaluated. 


I. Introduction 


A first-order recurrence is a sequence of 
evaluations in which the value of the latest term 
in the sequence depends on the previously comput- 
ed term. For most scientific applications, only 
the final term is desired, such as the inner 
product of two vectors, which forms the basis for 
most matrix manipulations. The inner product can 
be evaluated as a linear first-order recurrence. 
The evaluation of a recurrence presents a special 
problem for a parallel processing system because 
the definition itself is given in terms of sequen- 
tjal evaluation [3]. 

A general class of first-order recurrence 
equations can be expressed in the following form 
[7]: 


2[1) 
Zli] 


BCL] 
h(BLiJ,g(ALi],ZLi-1])) 


2S 1<N (1) 

where ALi] and BLi] are external inputs and h and 
g are index independent functions that satisfy 
the following restrictions: 1) h is associative; 
2) g distributes over h; and 3) g is semiassoci- 
ative. The external inputs are essentially two 
vectors with N elements. In this study, we are 
interested in the final scalar result ZEN] which 
is reduced from the vector inputs. Thus, the 
basic primitive operation which requires sequen- 
tial processing in the evaluation of the first- 
order recurrence is vector reduction. Vector 
reduction accepts vector input and produces a 
Single scalar output [5,8]. Perhaps the most 
common such vector reduction operation is the 
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vector summation, which takes one input vector 


and produces a single output equal to the sum of 


the elements of the input. Other vector reduction 
operations can be found in [12]. 


In a sequential processor, the evaluation of 
the first-order recurrence involves a DO loop 
operating on one element from each vector, ALi] 
and BLi], at a time as defined in Eq.(1). Figure 
l(a) shows a sequential organization for the 
evaluation of the first-order recurrence. Two 
functional units, g and f, are connected serially 
with output of h feeding the input of g. A 
simplified diagram which combines the functions h 
and g is shown in Fig. 1(b), where f=hg. We may 
combine the external input sources, ALi] and 
BLiJ, into one external input source X[Li], as in 
the further simplified diagram shown in Figure 
l(c). This is the model that we will use in this 
paper. f is also called a vector’ reduction 
Operator because it reduces the vector input to 


a scalar output. Figure 1l(d) shows the functional 


organization for performing inner product S=AeB. 
The vector reduction operator, "+", is indicated 
by the dash-line box. 


ACi] BCi) ACi] BC i] 
i-1) 


i-1] 2C i 


2Z(€N] Z=Z{N] 

(b) Two functional units h and g 
are combined into one functional 
unit f 


(a) The functional organization 
for performing first-order 
recurrence equation 


ACi) BCi] 
i-1] 


2=Z(N] 


S* LAL i)*BCi) 


(d) The functional organization for 


(c) A non-pipelined model of a 
ee performing vector inner product 


vector reduction unit 


Fig.l. Evaluation of first-order recurrence 


Parallelism and pipelining are two major 
techniques in achieving high performance process- 
or unit design. Evaluation of a recurrence system 
in an array processor has been extensively studi- 
ed by many researchers [1,7]. An interconnection 
network is required in an array processor to pro- 
vide the necessary routing paths for the purpose 
of cyclic vector reduction. An array processor 
which is capable of evaluating multiple recurr- 
ence systems has been studied by [6,10]. 

This paper deals with pipelined evaluation 
of the first-order recurrence. Pipelining 
generally takes the approach of splitting the 
function to be performed into small pieces and 
allocating separate hardware to each _ piece, 
termed a segment. Usually, the rate of data 
flow through the pipeline is independent of the 
number of segments and dependent on the rate at 
which new data may be fed into the pipeline. If a 
function is capable of being partitioned into K 
segments, then a pipeline designed to execute the 
same function repeatedly can perform the function 
K times faster, at most. This peak performance 
can be achieved only if the input elements are 
mutually independent and the input string is very 
long. However, due to the recurrence relation, 
peak performance may not be easily achieved. 


The pipelined design of a vector reduction 
unit needs a feedback path from the output to the 
input of the pipeline. Here only the static pipe- 
line is considered. The existence of feedback 
implies a certain sequentialism to the function 
being evaluated. Since each output of the pipe- 
line depends on previous outputs, pipelining does 
not help in the direct implementation of Eq.(1). 
Improper or inefficient control of such feedback 
around a pipeline can destroy its efficiency and 
decrease its throughput. This paper studies the 
construction and scheduling of pipelined reduc- 
tion units while maintaining a high throughput 
rate. 

Vector summation or inner product’ instruc- 
tions have been implemented in many vector pro- 
cessors, for example, IBM 3838, TI-ASC, STAR-100, 
CRAY-1 [4] and ESL systolic processor [10]. This 
paper provides a generalized model for the pipe- 
lined evaluation of first-order recurrence under 
different scheduling strategies. Both single 
vector input and multiple vector inputs are 
considered for an environment consisting of one 
pipelined reduction unit. 

In section II, a traditional recursive reduc- 
tion method is briefly described. A symmetric 
reduction method is then introduced. A _ more 
faster asymmetric reduction method is proposed in 
Section III. The above methods are also compared 
in terms of control complexity, processing speed, 
and the buffer requirements. Section IV discusses 
the situation in which multiple vector inputs 
request to perform the same recurrence operation 
in a single pipeline unit. An interleaved schedul- 
ing of multiple vector inputs sharing a single 
pipeline is proposed. Finally, an example is demo- 
nstrated with mixed single and multiple vector 
processing by chaining several pipeline units. 
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II. Symmetric Single Vector Reduction 


A first-order recurrence system has a single 
vector input. It is assumed that a static pipe- 
line with K segments is used in evaluating a 
vector input of N elements, where N and K are 
arbitrary integer values. Further, each memory or 
buffer can supply one element per pipeline cycle 
time. 

In order to evaluate the final scalar re- 
sult, a recursive reduction method has been pro- 
posed by Kuck [9] with N being an integer power 
of 2. The N elements are divided into two halves. 
N/2 pairs of elements are processed in the first 
iteration through the pipeline. The intermediate 
result vector of N/2 elements is then again divid- 
ed into two halves, N/4 elements each, in the 
second iteration. After Tog,N iterations, the 


final scalar result is obtained. 

A generalized procedure based on the recur- 
Sive reduction of a vector with an arbitrary 
value of N was specified in [12]. In addition to 


storing the input vector or the intermediate 
result in the memory, two extra buffers are 
needed to hold the divided subvectors. When N is 


large, the buffer will significantly increase the 
hardware cost and time delays. Let T(N,K) denote 
the number of cycles required to evaluate an 
input vector with N elements in a pipeline with K 
segments. The following result was obtained [12]. 


Theorem 1: 
The recursive reduction method requires 
TA,(N,K) cycles, where 
RR 
Tan (NyK)=(K-1) [log N1+3(N-1) (2) 


The excessive buffer can be eliminated by 
using a feedback path from the output to the 
input of the pipeline as shown in Fig. 2. B(t) is 
the feedback input at the t-th pipeline cycle and 
C is a constant. The feedback input will be latch- 
ed if e=l. B(t-j) indicates the feedback input at 
the (t-j)-th cycle. 


X{i) 


0 o| C ,xXfij 
0 1 Cc 

1 0 

11 » Bit-j) 


fel__e0 | INPUT PAIR | 
» Bit-j) 
» X{i] 
Fig.2. The hardware organization of a pipelined 
processor with K segments 


The reduction operator, f, is commutative. 
For ease of explanation, we assume N>K. The N 
elements can be partitioned into K groups. 


L= ZEN = FOP CL), (P(2) s, oee5 PR) (3) 


where 


P(i) = f(XEN-i+1], XEN-K-i+1], XCN-2K-i+1], 
roe, for 1<i<K (4) 


If (N mod K)=0O then P(i) is the reduction 
result of N/K elements for all i; otherwise, P(i) 
is the reduction result of [N/K] elements for 
l<i<(N mod K) and of [N/K] elements for (N mod K) 
<i<k. T(N,K) can be divided into four phases. 


TONSK)=T,(NK)AT (NS K)FTA(NSK)#TG(NSK) (5) 


where T. is the time needed to fill up the 
pipeline, Vy is the time needed to partition 


the N elements into K groups, T. is the time 
required to merge K groups into one group, and 
Ty is the time required to drain the pipeline. 


Since the input elements X{i]J's are supplied at 
the rate of one element per cycle, the following 
results are obtained. 


T.(N,K) = Min{N,K} (6) 
TA ONSK) = Max{0,N-K} (7) 


In total, T and 1, add to N cycles. The 


input elements may come from local memory or 
from the output of another pipeline. The control 
inputs will be (cl,cO,e)=(0,0,0) and (cl,cO,e) = 
(1,0,0) for the evaluation of Eqs.(6) and (7), 
respectively. The constant input C is chosen so 
as to make f(C,X)=X. In the case of vector summa- 
tion, C will be QO. After N cycles, the external 
input, X, to the pipeline will be set to the same 
value as C. 

A pipeline segment is productive for a 
given clock period if the segment is actively 
involved in the computation during that period; 
otherwise, the segment is called unproductive. 
This implies that the result generated by an 
unproductive segment will be ignored. After N 
cycles (if N>K), all K segments are productive 
and the i-th segment will be operating on P(i). 
The merging of K groups is done by combining two 
groups at a time. The number of groups is reduced 
by half after each iteration. Thus, Plogok] 


iterations will result in a single group. Two 
methods are proposed for the group-merging phase. 
The symmetric reduction method is_ described 
first. The asymmetric reduction method will be 
discussed in the subsequent section. 


Once all groups are merged into one group, K 
cycles are needed to drain the pipeline in the 
last phase. The drain delay is equal to 


Ty (Ns K=K (8) 
In the symmetric reduction (SR) method, two 


consecutive productive segments are merged at one 
time as depicted in Fig. 3. Figure 3 shows the 


Segments 
S,; S, Ss Sq S, Sg S; Sg Sg Sip 


= 10 
Ko The 
first 
K, =5 iteration 
The 
second 
K, =3 iteration 
o The 
third 
K, =2 iteration 
The 
fourth 
K,=1 0 0 0 0 o 0 iteration 
(The number of ————> Pipeline flow 
pert © productive segment 


© unproductive segment 


Fig.3. The pattern of productive segments before 
and after each iteration for the symmetric 
reduction in a pipeline with K=10 segments 


pattern of productive segments in each iteration 
for a pipeline of 10 segments, where K. denotes 


the number of productive segments after the i-th 


iteration and Kok. When K. is odd, the group 


in the first segment will be merged with a dumny 
group. Note that the group in the first segment 
always includes the original group P(1) at the 
beginning and end of each iteration. Thus, the 
time required for each iteration is the time 
needed to route the group in the first segment, 
possibly merged with another group, back to the 
first segment. The distance between two consecu- 


tive productive segments is always 2' after the 
i-th iteration. The control sequence is illustrat- 
ed below for merging K' groups in a pipeline with 
K'<K. The function, E(x), is defined to be 1 if 
the integer x is odd. If x is even, then E(x)=0. 


ALGORITHM-SGM: Symmetric Group-Merging 


Input: K' productive segments (K'<K). 
Output: (1) The control sequence (cl,cO,e), and 
(2) The group-merging time Tacsry (k') of 


of merging K' groups into one group. 


Procedure: 


Begin 
t=0; 
for i=l to Mog, "] do begin 


if K' <K then {for t=t+1 to t=t+K-K', set 
(cl,c0,e)=(0,0,0)}; 


if (K'=1)mod(2'~+) > 0 then {for t=t+1 to 


t=t+(K'-1)mod(2'~4), (cl,c0,e)=(0,0,0)}; 
G=0; (* G is a boolean variable *) 


p=£( fk! /2'747) (* D is a boolean variable *) 


for j=1 to [k'/2'~*] - D do begin 
for t=t+l, set (cl,cO,e)=(G,6,G'); 


if i>] then {for tet+] to t=t+2'74-1, 
set (cl,c0O,e)=(0,0,0)}; 
G=G'; 
end; (* end for *) 
for t=t+l do (cl,c0,e)=(D,1,0); 
end; (* end for *) 


End. 


If N<K, then K'=N; otherwise, K'=K. The 
above algorithm can be applied to an arbitrary 
—_ t 


number of cycles required for each iteration will 
be the time required to go through the whole 
pipeline once, ji.e., K cycles, plus a certain 
amount of delay cycles in the latch if the number 
of productive segments is odd. Table 1(a) shows 
the contents of pipeline segments and the latch 
during the merging of six groups in each cycle 


using the SR method. The following lemmas are 
used in evaluating the time required in_ the 
group-merging phase. Lemma 1 can be proved by 


induction. 


Lemma 1: 
The number of productive segments after the 
i-th iteration equals 


eget tae? (9) 


Lemma 2: 
If K is not an integer power of 2, then 
Nog KV -1 
j [logK] 
= PEK S02) eek (10) 
i=0 
Proof: Eq.(10) can be proved by induction. 
However, the following equation must be proved 
FIPSL. 
Nog, kl -1 Mog, (K+) -] 
i i 7 
ss 2 E(K ;) - >. 2 E((K+1) ;) = ] C1) 
1=0 1=0 


Since K is not an integer power of 2, K can 


be written as K=2"d, where m>0 and d is an odd 
number. Thus, E(K;)=0 for i<m_— and E(K al. 


Then we want to show that E((K+1),)=1 for i<m 
and E((K+1) _)=0. 


If K is odd, then m=0 and K+l is even. Thus, 
E(Ky)=1, and E((K+1)))=1. 


If K is even, then K+tl is odd and 
(K+1)9=2"d+1. (K+1), can be expressed as 

(K+L), = f2™ 'ae274] = 2 Tae1 for ice 
Thus, E((K+1) .)=1 for i<m. Since d is odd and 
(K+1) =[dt2"™1=d+1, we have E(K+1)_)=0 and 
(K+1) = (d+l) /2}=[d/2}=Kk, From Lemma 1, 


we have K=(K+1) | for i>m. 


Since K is not an integer power of 2, this 
implies that Mog,k1=[log,(K+1)1. Eq. (11) 
thus can be proved. By induction, Eq.(10) can be 
easily derived. 
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Theorem 2: 


In the group-merging phase of the symmetric 
reduction method, the total time delay equals 
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if NOK 


g(K) 


(12) 
g(N)+(K-N) PogoN] 


"(SR)" | if NK 


where g(M)=MPlog,M}+2! 193 Ml, 


Proof of the Theorem 2 follows directly from 
the Lemma 2. Note that if K;_, is even then K 


i-th 


wise, it takes an extra es cycles to merge 
with a dummy segment as stated in the Algorithm. 
If N<K, we may consider a pipeline having N 
segments and K-N dummy segments. Thus, each 
iteration will required (K-N) extra cycles due to 
the delay in dummy segments. Since it takes 
Flog,N] iterations, Eq.(12) is achieved. 


cycles are needed in the iteration; other- 


III. Asymmetric Single Vector Reduction 


The symmetric reduction method is good for 
microprogrammed control because the generation of 
control sequence has a regular pattern. However, 
it takes more cycles as shown in Table l(a). In 
the second iteration, four cycles later, three 
groups have been merged into two groups. But it 
takes four more cycles to make the first pipeline 
segment containing P(1). The. asymmetric reduc- 
tion (AR) method can eliminate unnecessary 
delays as illustrated in Table 1(b). However, the 
control sequence is no longer a regular pattern 
because it is very difficult to express the dis- 
tance between productive segments mathematically. 
A hardwired control can be easily implemented to 
generate the control sequence. 


With the AR method, the pipeline processor 
needs to record the state of each segment as well 
as the latch. Denote the state of the latch as 
SQ and the states of pipeline segments as Sy. 


Sos macs Sys S;=1 that the i-th 


segment is productive; otherwise, it is 0. The | 
state of the pipeline is expressed by a K+l tuple 
(SgsSzoeeesSp)- Initially, we have SQ=0 


indicates 


Table 1. Contents of pipeline segments (K=6) of 
each cycle during group-merging phase 


(b) Asymmetric Reduction (AR) method 


and S=S5=---=Spel, if K<N, and S1=S5 
=...=S,=1 and Sy pre 25795 if KON. ~ The 
group-merging operation is terminated if S,=1 
and Stee e=Sy=0. 


are primarily determined 
of So» 54> and Sys 


and Spal. If 


The control signals 
the current states 


The latch enabled if S979 


the latch is occupied and the last segment is 
productive, these two groups will be merged in 
the next cycle. The control outputs can be expres- 
sed as follows: 


by 
1s 


cl = cO = S,S 

= (13) 
e=§ Sy 
A state transition table can be easily 


derived. The first segment becomes productive if 
two groups from the latch and the last segment 
are merged. The next state is expressed by the 


present state as follows: 
Sg : 39 - SK 
Sy = SoS (14) 
S; = S514 2<1<K 


The above equations can be easily modified 
to cover three other phases. Theorem 3 states the 
total number of cycles required for the group- 
merging phase in the AR method. 


Theorem 3: 
In the group-merging phase of the asymmetric 
reduction method, the total time delay equals 
h(K) if N>K 
(15) 


TAR )* 


h(N)+(K-N) Flog oN] if N<K 


where n(M)=MPIog Hf] 2! 1092 MT oy, 


Using the AR, the number of produc- 
in each iteration is same as with 
the SR method. However, the distance between any 
two consecutive productive segments is no longer 
a constant. But the distance between the first 


and the second occupied segment is still ghee 
_ before the i-th iteration. If the number of pro- 


ductive segments, K. p> is odd before the i-th 


Proof: 
tive segments 


iteration, the last merge will be on the 2nd and 
the 3rd productive segments. The first productive 
segment does not have to go through the pipeline. 


Thus, if K;_, is odd, it takes K-2' cycles 


for the i-th iteration; otherwise, it takes K 
cycles. Eq.(15) thus can be achieved by applying 
Lemma 2. 

Q.E.D. 


Table 2 shows the total number of cycles 
required for group-merging for K=l to 16 and for 
the SR and the AR methods, respectively. It also 
Shows the number of cycles required and_ the 
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Table 2. Number of cycles required in each itera- 
tion of the group-merging phase for 
different size of pipeline segments 


g(K) #h(K) 
SR | AR 


B 


wt nN -_ 


ae 
ex 
Ee 
Fa 
Ba 
ee 
a 
ao 
Ee 
is 


PS: Productive segments after the i-th iteration 
SR: Symmetric reduction method 
AR: Asymmetric reduction method 


number of productive segments in each iteration. 
Note that the number of cycles required for the 
SR method is at least K for each iteration, 
whereas it is at most K for the AR method. 
Precisely, the SR method requires 


22! logkT _y, more cycles than the AR method, if 


NOK. If K=2* for some k, the total 
time will be the same for both methods. 


processing 
In the 


worst case of Ka2+1, the SR methods requires 
2(K-2) more cycles than the AR method. The number 
of pipeline segments, K, in a typical vector 
processor is in the range (2,15) [2]. Thus, when 
N as much greater than K, the difference in 
processing time between the SR method and the AR 
method is at most 2(K-2), which is insignificant. 
Table 3 lists the total vector reduction time for 
some typical values of N and K under’ three 
different reduction methods. In general, if N is 
much greater than K, the saving in_- vector 
reduction time in these two proposed methods is 
2N+0(K( log,N-log.K)) cycles over the - recur- 


Sive reduction method. Furthermore, the 
buffer space is eliminated. 


extra 


Table 3. Comparison of three different methods 
in terms of the total processing time 
for some typical values of N and K 


K 
K 
K 
K=15 


RR: Recursive Reduction 
SR: Symmetric Reduction 
AR: Asymmetric Reduction 


IV. Interleaved Multiple Vector Reduction 


A set of independent recurrence systems with 
the same function may request to evaluate their 
final values. For example, a matrix and vector 
multiplication involves many independent vector 
inner product operations. The scheduling of multi- 
ple vector inputs in a single pipeline processor 
is studied below. Assuming M independent vectors 
of N elements each, the scalar result for the 
j-th vector is defined by 
YEJIFF CALI» LI, ALI, 21y sees XLIZNU) ~“Lsdsm (16) 

The memory is assumed to supply one element 
per pipeline cycle. The simplest approach to 
evaluate Eq.(16) is to process one vector at a 
time. The total processing time will be M-T(N,K). 
With careful control, draining the pipeline of 
the last vector input can overlap with filling up 
the pipeline of the next vector input. Thus, the 
total processing time can be- reduced to 
MeT(N,K)-K(M-1). Even with this overlapping opera- 
tions between vectors, the pipeline is still not 
fully utilized during the group-merging phase. A 
more efficient scheduling approach is to inter- 
leave the processing of multiple vector inputs. 

Consider the case of M=K first. For inter- 
leaved processing, the memory will supply X{1,1], 
XC2,1],...,X0M,1] for the first M cycles, X{1,2], 
XC2,2], ..., X~LM,2] for the next M cycles, and so 


on. After the first M cycles, the i-th pipeline 
segment will be operating on X[{M+tl-i,1] for 
l<i<K. Another M cycles later, the i-th pipeline 
segment will be operating on X[M+tl-i,1] and 
XCM+1-i,2]. After M:N cycles, the i-th pipeline 
Segment will be operating on  X{M+1-i,1], 


XCM+1-i,2], ..., and X{M1-i,N]. At the end, K=M 
more cycles are needed to drain the pipeline. The 
total processing time will be K(N+1). 

Obviously, this approach is considerably 
faster than processing each vector sequentially 
because no merging phase is involved. Pipeline 
segments are unproductive only at the initial 
filling up phase and at the last draining phase. 
The idea of interleaved processing is to allow 
all pipeline segments to be shared by as many 
vectors as possible. It can be considered to have 
M virtual nonpipelined processors as shown in 
Fig.l(c). Each virtual processor is dedicated to 
one vector input with one input element per K 
cycles. 


If M>K, 
with pipeline 
intermediate 


not all M vectors can be allocated 
segments. In order to save the 
results of the remaining M-K 
vectors, M-K dummy segments are introduced as 
Shown in Fig. 4. Since the number of input 
vectors may vary, the length of the dummy segment 
buffer, D, is adjustable by program control. 
There are virtually M nonpipelined processors. 
Each virtual processor has a vector input with 
one element per M cycles. Physically, each vector 
input is assigned with one pipeline segment. With 
D dummy segments inserted, the pipeline can be 
viewed as an M-segment pipe. The total processing 
time will be M(N+1). 
If M<K, some vectors may be allocated with 
more than one segment, and different vectors may 
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have a different number of segments. In this 
case, the control of the pipeline will be very 
difficult because the procedure of merging groups 
for each vector will be input-dependent. Since K 
is not always an integer multiple of M, one way 


to allocate vectors with an equal number of 
segments is to leave some segments idle which 
results in low pipeline utilization. With the 


help of a dummy segment buffer, the pipeline can 
be fully utilized by choosing D=[K/M]M-K. 

let Q{K/Ml=(D+K)/M be the number of seg- 
ments allocated to each vector including dummy 
segments. By feeding the elements into the pipe- 
line in an interleaved fashion as before, MN 
cycles later, all D+K segments will be occupied 
by groups of vectors. Each vector has Q groups in 
Q segments. Since Q1, a group-merging phase is 
required. In Fig. 2, a latch was used to hold the 
group to be merged with the next group. For multi- 
ple vector inputs, each vector needs its own 
latch. A FIFO latch buffer is provided in Fig.4 
for this purpose. Again, the system can be con- 
sidered to have M virtual processors. Each proces- 
sor is a pipeline organization with Q segments 
and each segment takes M cycles. 

Either the SR method or the AR method can be 
used during the group-merging phase. However, the 
control sequence needs to be modified. The con- 
trol sequence for each vector is the same as that 
of single vector reduction. The number of itera- 
tions will be flog, a during the group-merging 


phase. Since M vectors are processed in an inter- 
leaved fashion, each control output must be 
repeated M times, one for each input vector. The 
size of the FIFO buffer is obviously chosen to be 
K-l1 because M<K. The total processing time in 
this case will be MeN+Q-T_(N, Q)*K. The quantity 


T iN» Q) depends on the reduction method to be 


used. This was evaluated in Theorem 2 and Theorem 
3 for the SR and the AR methods, respectively. 


Fig.4. The hardware organization of a pipelined 
processor for interleaved multiple vector 
reduction | 


If the number of input vectors is too large, 
the dummy segment buffer may not be able to 
provide enough dummy segments. In this case, the 
vector inputs must be partitioned such that one 
block of input vectors is processed at a time 
according to the above procedures. In multiple 
vector processing, the low pipeline utilization 
due to group-merging can be essentially eliminat- 
ed. The following example may further demons- 
trate the usage of a pipeline processor with a 
direct feedback path. 


Example: 

Let A be an NxN matrix and B be an Nxl 
vector. € is an Nxl vector obtained by performing 
matrix-vector multiplication of A and B. Given A 
and B, we want to find the maximum element in 
vector ¢. 

Three 
adder, 
Shown 


pipeline processors - multiplier, 
and comparator - are chained together as 
in Fig. 5 with Ka Ko and K. Sseg- 


ments, respectively. For simplicity, N is assumed 
to be equal to nko A and B are stored in two 


independent memory modules. The A matrix jis parti- 
tioned into n blocks. Each block is a K. by N 
submatrix. : 

The pipeline adder is needed to perform mult- 
iple vector reduction of N input -vectors and to 
result in N elements of the C vector, whereas the 
pipeline comparator is used to perform single vec- 
tor reduction and to produce the desired scalar 
output. To allow for overlapping of two blocks of 
data in a single pipeline processor, the feedback 
input either comes from the output of the pipe- 
line or is supplied with a dummy input. The 
Switching in the output of the pipeline can be 
easily controlled by a counter of value K_N. 

In the first N*N cycles, two input Slements, 
one from the A matrix and one from the B vector, 
will be fetched from the memory modules and fed 
into the pipeline multiplier. At the c-th cycle, 


L<c<n®, elements aj and by are fetched, 
where O<i,j,k<N-1 and 
TK LCc-1)/K NJ +LC(c-1)mod (KN) ) /NJ 
J=((c-1)mod (Kk .N)) mod N ere 


k=[((c-1)mod(K .N))/K. | 
The input pattern of matrix A and vector B 
is also shown in Fig.5. The matrix A is fetched 
block by block. Each block is fetched in column- 
major and takes KAN cycles. Corresponding to 


the input of one block of matrix A, the vector B 
will provide N elements and each element will be 
repeatedly used K, cycles. 


The external input to the adder comes from 
the output of the multiplier. The first input to 
the adder occurs when c=K +1. The adder will 


produce Ke outputs, which are inputs of the 
comparator, every KN cycles. The comparator 
will be fed with K. consecutive elements from 


the output of the pipeline adder and will be fed 
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A matrix 


B vector 


Fig.5. Three pipeline units and two memory 
modules are organized to perform the 
operation described in the example. 


with dummy input for the next K (N-1) cycles. 


This process will be repeated n times. The C 
vector will be merged into Ks groups after 
K tN tk, cycles. If the AR method is used, 
h(K ) cycles are needed to merge these K 
groups into one group. K more cycles are 
needed to drain the pifeline. In total, 
2 
K tN TK THK AK. cycles are required 


to obtain the final result. When N is very large, 
the total processing time is dominated by the 
one-pass fetch of the A matrix. 


By partitioning the matrix inputs, schedul- 
ing the input elements, and employing more 
pipeline units, the above method can be extended 


to evaluate many other matrix operations. This 
method has been successfuly applied in designing 
a VLSI systolic architecture for the purpose of 
pattern clustering [13]. 


V. Conclusion 


New scheduling methods which can efficiently 
evaluate the first-order recurrence system in a 
pipeline processor have been demonstrated. It has 
been shown that the asymmetric reduction method 
is faster while the symmetric reduction method is 
good for microprogrammed control. When the number 
of segments is an integer power of two, the 
symmetric reduction method behaves the same as 
the asymmetric reduction method. If the length of 
the input vector is very long, the difference of 
their processing times are not significant. Both 
of these methods are better than the conventional 
recursive reduction method for reducing proces- 
sing time and eliminating temporary buffer. 

To evaluate multiple first-order recurrence 
systems in a single pipeline, the pipeline utili- 
zation can be further increased by interleaving 
multiple vector inputs. A physical pipeline 


shared by many vector inputs can be viewed as 
having many virtual reduction § processors, in 
which each virtual processor is dedicated to one 
vector reduction operation. The pipeline utili- 
zation is further increased by totally or partial- 
ly eliminating the group-merging phase. 

Several pipeline units can be chained toge- 
ther and the vector inputs can be partitioned to 
facilitate interleaved processing as indicated in 
the example shown in Section IV. The design is 
actually a two-level pipelined architecture. Due 
to the feedback loop involved, scheduling of the 
external inputs and connecting of different pipe- 
line units must be carefully considered to avoid 
conflict. A system including multiple pipeline 
units and parallel memory modules can provide a 
more efficient systolic architecture for evalua- 
ting various matrix manipulations and is suitable 
for VLSI implementation. : 
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THE SOLUTION OF LINEAR RECURRENCE RELATIONS ON PIPELINED PROCESSORS 


W.Oed 
Zentralinstitut fuer Angewandte Mathematik (ZAM) 
Kernforschungsanlage Juelich GmbH 
5160 Juelich - West Germany 


Abstract Recurrence relations are frequently to 
be solved iteratively on a computer. Since the 
computation of the i~th value usually depends on 
the (i-1)-th value, pipelined processors cannot be 
used to their full potential. A transformation 
method for obtaining an equivalent recurrence rel- 
ation not depending on the previous value together 
with some stability considerations concerning 
equivalent transformations are presented. 


Introduction 


Various aspects of the efficient solution of 
linear recurrence relations on parallel or pipe- 
lined processors have been investigated; 
e.g-(1]-[4]. We focus on the solution of linear 
recurrence relations on architectures where float- 
ing-point addition and floating-point multiplica- 
tion are performed by pipelined functional units. 

Our aim is to speed up recurrence relations on 
pipelined processors, where the speedup S is 
defined as the ratio of the straight forward, ina 


sense ‘sequential’ output latency 1, over any 
optimized output latency l, 
S = 1lg/lo (1) 


For a pipelined processor the stage utilization 
indicates for each stage how often on the average 
that stage processes data within a regarded inter- 
val [6]. The length of the interval is the latency 
l, for the sequential case, and 1, for the optim- 
ized case respectively. The stage utilization for 
the j-th stage in the sequential case is given by 


1s 
Yay a 2 aig 8. 
s i=1 
with qi 320 if stage j is idle, qi gt! if stage jis 


(2) 


busy at cycle i. If U,, is the maximum value of 
Us4, j=1,2,...,k, then the optimized output 
latency will be 

1, = 1gUsm - (3) 


Linear Recurrence Relations and Pipelining 


First we will investigate an m-th order linear 
homogeneous recurrence relation with constant 
coefficients of the type 


(4) 


with given initial values up,uj,...,Up_1 and real 
coefficients Am—11am-29+++24Q- 

Consider a second order’ recurrence 
with given initial values Up and uy 


Uj Am Uj_1-ag-2Uj-a-+ ++ "aQuUy_m = 0 


relation 


5) 


As an example, a computer is taken with one 
pipelined floating-point multiplier and one pipe- 
lined floating-point: adder. Let kmyyj, be the number 
of stages for the. multiplier, and kapp the number 
of stages for the adder. Figure 1 depicts in a 
reservation table the timing for this recurrence, 


¥ i 
Us = 44444 F4#QUy_2 ° 
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(these are for instance 
As 


with kKmuL=3 and Kapp=43 
the characteristics of the FPS-164 processor). 
can be seen, every k time-units with 


k = KuuLtK app = 1s 

; is computed, once the computation has 
been set up. Note that some overlap is taking 
place even in the 'sequential' case, since agu;_4 
can be computed already before u. becomes availa- 


a new Us; 


ble. However, with Ugm=2/5, we do not have an 
optimal computation. 
In an m-th order recurrence relation 


m multiplications and (m-1) additions are to be 
performed. This implies l,=m to be the optimal 
output latency in the case of one multiplier and 
one adder thus yielding a speedup of 


Ss Levle = k/m . 
Equivalent Transformation of Recurrence Relations 


In order to achieve the desired speedup, the 
recurrence relation has to be transformed in such 
a way that the computation of a uy does not depend 
on the immediate predecessor. For improving 1, 


lg = 15¢01-Usm) (6) 
cycles can be gained per interval by 
"backstepping' as we will call it. The resulting 


recurrence relation should be of the type 


Vi Cm 1 Vig Cm-2Vin-gr+ +7 COVi-w = 0 (7) 
with o,f,...,A¢ natural numbers and the property 
1g0¢<A<...<f, where 


ls: l+lp = 1+f1,/m] (8) 


with 1p=l1,/m| the number of 'backsteps'. 

The new recurrence relation is called an equiv- 
alent transformation if uj;=v; for all i. Since 
the transformed recurrence relation is of higher 
order, more initial values are needed. They are to 
be computed from the original recurrence relation. 

One method for obtaining an equivalent trans- 
formation would be repeated substitution. Con- 
sider the second order linear recurrence relation 
(eq.(5)) which also can be written as 


Uj] = 47Uj-2+a0U4-3 (9) 
By substituting u;_; in eq.(5) by eq.(9), u, can 


be computed without depending upon u;_4, which 
results in 


(10) 


thus achieving a backstep of one. By further sub- 
stitution the desired 1h backsteps may be achieved 
in this fashion. However this procedure is rather 
awkward and may cause stability problems as will 
be shown later. 

Another method for obtaining an equivalent 
transformation uses some results regarding condi- 
tional recurrence sequences [7]. In a conditional 


Uz_1 = (ag+aj)Uy_p+agaquy_3 


recurrence sequence the values u; are obtained by 
a set of different recurrence relations, from 
which the recurrence relation for calculating the 
next u; is chosen upon a certain condition. Con- 
sider for example a sequence with initial values 
Ug and u, where the values are generated by the 
alternate application of three different recur- 
rence relations 


Uy -a4Uy_4-aQly _o = 0 for 0 = (i mod 3) 
Uy Uz _1-aQUy_2 = 0 for 1 = (i mod 3) (11) 
Uy -84U4_1-aguy_2 =0: for 2°] (i mod: 3) 


A shift-operator z is defined as 


Zu; 


ji = Uy. (12) 


and in general 
ZJuz 


z(zJ~1u, ) = = Ujaj 


With this shift operator eq.(11) can be written as 


(1eayqZ-agz?)uy = f(z)u; = 0 for O=(i mod 3) 
(1-ayz-agz2)uy = g(z)u, = 0 for 1=(i mod 3) (13) 
(1-aqz-agz*)uy = nC z)uy. S.0. for 2=Ci mod 3) 


where f(z),g(z),h(z) are 
uj. 

In general a conditional recurrence sequence is 
generated by means of a set P={f(z),g(z),...} of 
rm-th order polynomials and a decision function 
Q(i) by which a polynomial is selected from P for 
application to u In our example we have r=3 and 
Q(i)=(i mod 3). 

The theory describes a method which constructs 
from r polynomials of set P a single polynomial 
F(z), which is a polynomial in z™, When F(z) is 
applied unconditionally to u; the same recurrence 
sequence is generated. 

The procedure is demonstrated for the polynomi- 
als of eq.(13). First f(z),g(z),h(z) each are 
split into polynomials of powers of z™, in our 
example 2Z~”, where 


f(z) = fo(23)4zf4(z3)+z°f5(z3) 


polynomials applied to 


i? 


with 
f9(z3) = 1 ) 
for f(z) as it is given in eq.(13); g(z) and h(z) 


are treated correspondingly. Eq.(13) can then be 
written as 


f ,(z3) = ay ’ —f£5(z3) = -ag 


[£9(z3)+zf4(23)4+22fo(z3)]u3, = 0 
[eq (z3)4z8,(23)422e5(z3) 1u3 544 = 0 (14) 
[hg (z3)+zh4(z3)4z%h5(z3) ]u3 340 = 0 

with j=0,1,---,n/3. Using the shift operator 


(eq.(12)), the system (14) 
matrix form 


may then be written in 


fo(z3) 23f9(z3) 23f4(23)| [u3; 
81(23)  gg(z3) —-23g0(z3)} fuzzy1] = 0- (15) 
ho(z3) hy(z3) ho (z3) U3 542 
It is proved in[7] that the same recurrence 
sequence is obtained if eq.(15) is written as 
F(z)uyz = 0 


with F(z) the value of the determinant 


fo(z3)  23f5(23)  23£4(23) 
g4(z3) Bq (z3) z3go(z3) 


F(z) = 
ho(z3) hy(z3) ho(z3) 


(16) 
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Multiplication of the j-th row by zJ and division 


of the j-th column by zJ does not change the value 
of the determinant (16) and finally 


fg (z3) 2°f5(z3) 2f4(z3) 
F(z). S042 4 (23) go (z3) 2°g5(z3) 
Z“ho(z3) zh4(z3) hg (z3) 


is produced, which is a polynomial in 23, 

This theory also holds for the special case 
that all polynomials are identical, i.e. 
f(z)=g(z)=h(z); an important notion for our pur- 
pose. Recall that in order to achieve an optimal 
speedup at least 1, backsteps are required. There- 
fore all we have to do is to replicate the polyno- 
mial f(z) l1p+1=l1 times (eq.(8)) into the set 
P={f(z),f(z),...,f(z)} thus artificially creating 
a '‘conditional' recurrence sequence. It will 
become again unconditional after the transforma- 
tion has been carried through. The resulting 
polynomial (zl) applied to the original 
m-th order recurrence relation (eq.(4)) will yield 


Uj Om Ug 1 ~Cm—2Uj game CQUjemy = 9+ (17) 
As an example take a look at eq.(5), whose sequen- 
tial calculation is depicted in figure 1. With 
Usm=2/5 a speedup of S=5/2 should be gained. With 


eq.(6) and eq.(8) we find 1,=2 and consequently 
1=3. Therefore the set P={f(z),f(z),f(z)} with 


f(z) = 1-a4z-agz 


will be used to achieve 


1 -2a -Za 4 
-Za, 1 -22aq = 1-0423-cg9z6 
-2%ay -2a4 1 


With C4=3agay+a,3 and Cg=ag>. The equivalent 


recurrence relation to (10) will then be 

Uy = C 4Uz-3+CQUi-6 e 
The optimized case is illustrated in figure 2 with 
an optimal output latency of lp=2. 


Some Stability Considerations 


Two criteria for stability of linear homogene- 
ous recurrence relations are [8]: 


- absolute stability, if for all roots t; of the 
characteristic polynomial 
It, | < 1 for i = Vee yee 
- relative stability, if 
tq = tmax 
with 
Cag > maxi |t, |} fOr 4a. S152, vag 


the root with maximal absolute value and 


tg = max{It;| , ec; #0} for i = 1,2,...,m 


the dominant root, i.e. the maximal absolute root 
with a coefficient c;#0 in the general solution. 

We will assume that the original recurrence 
relation is stable. However by its transformation 
instability might be introduced, because’ the 
transformed relation is of higher order with addi- 
tional roots in the general solution. 

An example will demonstrate what might happen 
by using the substitution procedure (eq.(10)), and 
it also will be shown that the transformation 


described above does not introduce instability. 
Consider eq.(5) with a4=7/6 and ag=-1/3, which 


is absolutely stable, since the characteristic 
polynomial 
2 T 1 1 2 
t -~-t +—= GE - —)(t - =) 
6 3 2 . 


has tyaxy=2/3 as the maximal root. By the substitu- 
tion procedure the recurrence relation will become 


te =3Puj-2 > =AUy -3 


with the characteristic polynomial 


BL nee a ote ee 2 ui 
t 35° i (t 5) Ct s(t + z) 
which no longer is absolutely stable, since 
tmax=//6>1 does not fulfil the stability crite- 
rion. Note that the dominant root is tq=e¢/3 since 
the same sequence is to be generated. 

Applying the transformation as discussed above, 
any m-th order linear recurrence relation with the 
characteristic polynomial 


tla _,tM-t_a _otM-2_., ,~ap 
= (t-t1)(t-t2)(t-t2) ++ (t-tm) (18) 
will be transformed to a recurrence relation of 
the type (17), with the characteristic polynomial 
tMig ytim-Wlig at(m-2)1_. a, . (19) 


Substituting tl by y the polynomial (19) 


written as 


ean be 


yM on _yyM-l-cn_oy™-e-...-c9 
= (yry 1) {y~¥2) «++ (Yo¥mn) 
= (tl-y,)(tl-y5)...(¢ -Ym) 
= (t-t 44) (t-t yo)... (t-t 771) (t-to1)...(t-t,7). (20) 


For each of the roots Yr, r=1,2,...,m we obtain by 
factorization 


af j2ui+or . 


with j=1,2,...,l. Since the new recurrence rela- 
tion is equivalent to the original recurrence rel- 
ation, the roots t1,to,---,ty of ‘the original 
polynomial (eq.(18)) are included in the set of 
roots Crqebyorceey ty and the 1 different roots 
tn 3 for each r differ only in the angle, while the 
absolute values are the same. Therefore it is 
assured that the equivalent transformation of a 
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recurrence relation as described above, does not 
change the stability behaviour of the original 
recurrence relation. 


Conclusion 


Since the speedup potential of a recurrence 
relation can be predicted for a given machine 
architecture, the transformation presented can be 
performed automatically. In particular low order 
recurrence relations, which are the most common 
ones can be speeded up considerably. 
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figure 1. sequential computation of a second order linear recurrence 
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figure 2. overlapped computation of a second order linear recurrence 
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Abstract 
Long lines on a chip cause problems. There 
are unavoidable performance degradations. For ex- 


ample, capacitance either causes delays, or neces- 
sitates space and power consuming drivers. More- 
over, inexpensive chips are essentially 2-dimen- 
sional, so there simply is no space for long con- 
trol lines, especially if they must overlap. One 
alternative is to allow instructions to move in 
step with the operands. This necessitates a spec- 
ial system controller, and extra instruction buf- 
fers. The advantage is high throughput for pro- 
grams which can be reduced to straightforward pro- 
cedures. Floating point addition, e.g., is a pro- 
cedure which requires up to 2N segments in a pipe- 
line, N being the number of bits in the fractional 
part. In comparison, signed floating point multi- 
cation (or division) uses only about N segments. 
CORDIC, the most complex routine considered here, 
needs roughly 16N segments. The execution of data- 
stationary instructions has been simulated at the 
bit level using Fortran. Such simulations help 
establish the necessary size and shape of the chip 
surface to execute given instructions, and provide 
detail for design. Overall, the computer archi- 
tecture presented below should be relatively easy 
to implement. 


I. Introduction 


 Data-stationary instructions in a pipeline 
have the major advantage of being natural to VLSI, 
which needs maximum local communications [1,2]. 
Architectures based on fan out to parallel proces- 
sors usually cannot be used if the components have 
to be hardwired. Hardwiring implies the use of 3- 
dimensions, while chips are largely 2-dimensional. 
Examples of 3-dimensional systems, and other types 
of parallel processors may be found in a recent 
excellent survey [3]. 


It is awkward to program a pipeline for cer- 
tain instructions, e.g., conditional branches. 
Never-the-less, general purpose programming clear- 
ly is possible if certain design features are used 
[4]. The features include universal programmable 
segments, variable numbers of segments depending 
on the instruction, and the possibility of (non- 
local) communication between segments and main mem- 
ory. Kogge focused on data stationary instruct- 
ions, showing that they are easier to microprogran, 
and to optimize, that complex time-stationary pro- 
grams become shorter, that pipelines naturally fill 
and empty [5,61]. 


A new computer system based on data-stationary 
instructions hopefully would be able to run con- 
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ventional code; this prevents existing software 
from being outdated. Moreover, the resultant 
system has to be efficient in array processing, 
this being a bottleneck in present inexpensive 
computers. 


Memory Memory 


Manager 


Instruction 
Buffers 
Floating Point 
Processor Segments 


ya»: 


Figure l. 


An Architecture for Data-stationary Instructions 


II, System Considerations 


Figure 1 shows a typical design. Instruction 
codes shift left as they execute, and flow down 
along with the operands in the processor segments. 
Each instruction is allowed to use a variable num- 
ber of processor segments to complete execution; 
data in the pipeline is not latched until each 
segment is ready. Preliminary questions to be an- 
swered are, what is a convenient value for L, the 
number of processor segments, and how many words 
of memory are needed for the instruction buffers? 
The shape of the necessary area for instruction 
buffers depends on the type of instructions being 
executed. Path 1 in Figure 1 is the endpoint of a 
row of instructions which execute at roughly a unt 
form rate, instructions such as add, subtract, mul 
tiply, and divide. Array operations such as inner 
product also use a buffer area roughly the shape 
of an upsidedown right triangle. If the number of 
words in the base of the triangle is specified as 
W, the instruction buffers use up to about WL/2 


words. 


Longer procedures such as CORDIC prevent a 
uniform flow of instructions: CORDIC followed by 
shorter instructions may cause the last instruct- 
ion in a row to follow Path 2. Path 2 breaks as 
soon as the CORDIC instruction is done. Shorter 
instructions followed by CORDIC may cause the 
CORDIC instruction to follow Path 3, or Path 4. 
stream of instructions which follow Path 4 would 
free the vast majority of buffers. 


A 


The Memory Manager (MM)in Figure 1 must keep 
track of the types of instructions which are load- 
ed. Overflow beyond L segments should be prevent- 
ed; otherwise recycling will be necessary. In- 
struction recycling is undesirable, since unlucky 
instructions may be waiting for operands from main 
memory which must be generated by the recycled in- 
structions. Note that each processor segment is 
connected to main memory via a bus. 


Reading or writing in the system of Figure l 
may involve timing difficulties when sequential 
code is being run. Instructions which must not 
execute ahead of time are: a) a system must not 
load from a register in memory which has not yet 
received the proper data due to an unexecuted 
store to the register. b) a system must not store 
to a register whose data has not yet been used due 
to an unexecuted load from the register. Clearly, 
the nature of the code is important, since it may 
avoid using the same addresses for temporary stor- 
age. Data flow concepts apparently reduce the 
need for addressable memory; but languages and 
compilers are needed to support the data flow con- 
cept [7]. 


One function of the MM is to check instruct-— 
ions and addresses for those which must not be 
loaded into the instruction buffers due to poor 
timing. The MM may instead insert NO OPS until 
the next instruction can be loaded, or it may in- 
terrupt the program and run another. An interrupt 
involves storing recovery information in a stack. 
Interrupts are efficient when many structured sub- 
programs are waiting to be run. It may happen 
that the programming is such that mainly the first 
processor in the pipeline is active, the others 
being idle. This is, in effect, an automatic re- 
version to a single processor computer, something 
which is considered acceptable by the author. As 
long as arrays run efficiently, a typical program 
would execute at a fairly steady rate from the 
user's point of view. Again we stress that faster 
execution occurs via better coding. 


The MM also identifies and regulates bran- 
ches, e.g., the IF ( ) THEN ( ) ELSE( ). There 
are 2 options: (a) interrupt, running something 
else until the condition which determines branch 
direction is calculated. (b) run as much code as 
possible in both branches, deleting the unneeded 
results after the branch condition is calculated. 
Code can minimize conditional branches, e.g., by 
using fixed numbers of iterations in loops. The 
number of instructions in a loop may be used to 
determine the method of implementation. Loops 
with only a few instructions and a fixed number of 
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iterations may be loaded by the MM as one long 
sequence of code. Longer loops can also be 
straightened for the pipeline, but may need a way 
to exit under given conditions. The MM can de- 
tect the condition and stop feeding replicated 
code. Hence, a LOOP UNTIL ( ) is possible using 
the STATUS line in Figure l. 


III. The Matrix Processing Feature 


Matrix processing is implemented in a quite 
ordinary way; vector operands and partial products 
move in the same direction; array operands go 
diagonally. In comparison, Kung's systolic pro- 
cessing may have operands and partial products 
going to opposite directions; array operands move 
in at right angles [1, 8]. The instruction buf- 
fers may be loaded with data, or addresses to be 
processed in parallel. The options are: A) data 
can be put immediately into the buffers; B) ad- 
dresses can be put into the buffers in any order 
for direct or extended addressing; C) the addres- 
ses can be stepped one unit at a time from an in- 
dex placed in the buffer. A useful feature used 
in array processing is the End of Data (EOD) mark 
in the data stream to bring the system out of the 
array processing mode. 


Figure 2 illustrates the data flow for a 
simple inner product (c ab). 
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Figure 2. 
The Data Flow for an Inner Product (c=ab) 
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Each block for c represents roughly N segments, 
within which it is possible to execute a floating 
point multiply-add operation. Notice that proces- 
sor usage is complete, and that the data buffer 
space really has a triangular shape. Values are 
shifted left and down. The operands, e.g., a;,, bj, 
must be stacked together as shown. 


In a 1-dimensional pipeline, Matrix-vector 
multiplication (c = Ab) involves keeping the b 
vector in the first row until the matrix is fully 
loaded (see Figure 3). Although processor usage is 
complete, portions of b are repeated in the buffer 
space. The important thing is that a version of 
array processing indeed meshes with data-station- 
ary instruction processing. If 2-dimensional pipe 
lines were allowed, 1 layer would suffice to load 
a matrix-matrix multiplication (C=AB), as illus- 
trated in Figure 4. 


A,,, for example, is transmitted to the right 
column of C, i.e., C Bil is 


transmitted to the first row of C, i.e., Ciy> Co1> 
C3, and Cy4,;. The 4 x 4 matrix-matrix multiplicat- 


ion is completed after 4 time frames (not all 
shown). Note in the figure that arrays are in 
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Figure 3. 
The Data Flow For a Matrix-Vector Multiplication 
(c = Ab) 
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a 2-dimensional Pipeline 
standard form when viewed from the bottom. The 


processors are used completely. Each layer does a 
matrix-matrix multiplication, so a sequence of 
layers could do certain tensor operations. The 
problem is that today's VLSI is mostly 2-dimension- 
al. When 3-dimensional technologies become popular, 
3-dimensional pipelines may conveniently perform 
sophisticated iterations, recursions, and picture 
reconstructions. Meanwhile, this paper will be 
concerned with only 2-dimensions. 


IV. The Design of a Segment 


A standard method starts with a list of de- 
sired machine instructions, and ends with a timed 
logic circuit [9]. Standard methods are systemat- 
ic, but iterative. Especially interesting is the 
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microprogramming for data-stationary instructions; 
the microprogram must proceed in space and time 
from segment to segment along with the instruction. 
Each segment is self-timed, and is organized for 
floating point (FLP) arithmetic as is usually re- 
quired in a general computer. 


The Control Logic 


Unclocked combinational logic is located be- 
tween the latches of the pipeline as in Figure 5. 


Data Latch i Instruction Buffer i 
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MEMORY 
Data “Control L Logic for 
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Clock Data Latch i+] {Instruction Buffer i+|lb 


Figure 5. 


The General Plan for a Processing Segment 


Figure 6 shows the overall structure for 
the circuitry between the instruction buffers.This 
circuitry serves to shift, and to transfer in- 
structions or numerical data. An instruction which 
has been shifted into the leftmost part of the 
buffer will activate control lines and begin to 
execute. The signals in the control lines are se- 
ies aa by counts stored in the registers Seq”, 
seq! , Seq“. These counters can be incremented (or 
set ie 0) by INCO, INC1, and INC2. The control 
circuits are nested finite state machines. Future 
counts occur in subsequent segments. The PLAs de- 
code the instruction and the count; microprograms 
in the PLAs must be replicated for each segment. 
The Address Bus Control places addresses on the 
address bus to memory (dashed line) for reading or 
writing data. The address field associated with 
an instruction is assumed to vary from 0 to 8 
bytes; the instruction itself is allowed 1 byte. 
The Shift Left block in Figure 6 may need to shift 
up to 9 bytes. There is an option in the Shift 
Left block to shift addresses or data, but not the 
instruction. This option is used for inner pro- 
ducts. 


The MM prevents reading and writing which 
would result in numerical error in sequential 
code, as mentioned above. Permissible load and 
store commands may have to wait a short time to 
use the bus. The pipeline is held back until each 
load and store is accomplished; an inhibit line is 
included in the data bus. 


Load/Save logic in each segment generates 
read and write signals, and helps manage the bus 


traffic. Outputs waiting to be saved are serviced 
before input data is read, since outputs depend on 
the values in the input latches. The writes to 
memory are done one at a time, each processor hav- 
ing a status which the central memory manager can 
poll. 


Inputs to be read from memory may also have 
to wait a short time to use the bus. Reads can be 
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implemented by controlling the clock to the seg- 
ment. Segments with a load command usually would 
latch the input data which is read. Microcode has 
to allow for this method of reading memory. The 
reads are done one at a time, after the writes to 
memory. The bus management is fairly standard. 
The Conditional Branch Logic causes startup addres- 
ses to be moved from the interrupt stack kept by 
the MM. 


The Floating Point Processors 


The FLP processor segments have to be simple 
adders with a few gates. to direct data. The pro- 
cessors should be simpler than most currently 
available microprocessors since there is a need to 
save space. Algorithms which use adding and shift- 
ing are easily pipelined [10 - 15 ]. Iterations 
are accomplished in successive segments rather 
than by looping. Figure 7 shows a typical design. 
Fewer components might give perhaps only 1 gate 
delay per segment for higher throughput, but would 
increase the complexity of the microprogramming. 
Referring to Figure 7, the fractional parts of 
numbers are kept in latches A and B in signed 2s 
complement form; exponentials of the same form are 
in a and b. B' is for storage of operands, counts, 
and constants; b} receives the rightmost bit in Bg 
after a shift right. Various status bits are de- 
fined in the figure. 


The Microprogramming 


The machine instructions may be broken into 
microinstructions complete with sequencing infor- 
mation. Unfortunately, the microprograms cannot 
be published here due to limited space. The overall 
features are at least summarized: The Floating 
Point Addition (FLP ADD) takes 7 lines of micro- 
programming (not shown). The align exponents 
section may take up to N segments;. likewise for 
postnormalization, where N is the precision. The 
total is about 2N segments for FLP ADD. Further 
information about floating point systems may be 
found in Reference 10, Chapter 9. 


FLP SUB is very similar; FLP MUL based on 
Booth's pairs needs 6 lines of microprogramming 
(not shown). It needs about N segments. FLP DIV 
based on a modification of Guild's method needs 8 
lines for signed division. About N segments are 
required. 


FLP CORDIC 


The nested finite state machines permit sever- 
al methods for function evaluation, but CORDIC is 
most interesting as an example [15]. It evaluates 
more than one function at once, if desired; and it 
may be used for plane matrix rotations [16]. To 
simplify, the easily implemented pre-normalizations 
and post-normalizations are omitted from this dis- 
cussion. CORDIC needs 9 lines of microprogramming, 
and also parts of the FLP ADD routine. Each addit- 
ion is assumed to take 1 segment to prenormalize, 

1 to add (or subtract), and 1 to postnormalize. N 
bits of precision are generated in N mathematical 
iterations; it is estimated that 16N segments are 
needed in a CORDIC evaluation. ; 
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“written to memory. 


FLP IP (Inner Product) 


FLP IP is little more than FLP MUL; register 
B,b accumulates the inner product. The ITERO 
command keeps adding partial products until an EOD 
mark in the data stream is detected by circuitry. 
The DONE is then enabled, feeding the next in- 
struction, with the option to store the inner 
product in memory. Three lines of microprogram- 
ming are needed, and also parts of the FLP MUL 
routine. . 


V. Fortran Simulation 


Ordinary Fortran using integer arrays filled 
with ones and zeros is useful for digital simula- 
tions. The author's simulations used a 'PLA' sub- 
routine to set control lines, and a main 'SEGSIM' 
program to call for various register transfers, 
e.g., APB, which adds B to A in binary. The 
reasoning behind the above microprogramming was 
thus verified. Simulations at the bit level bring 
one closer to the necessary detail for VLSI design 
where trial and error methods can be extremely ex- 
pensive. 


VI. Summary and Conclusions 


A designer has several particular trade-offs: 
1) Only one bus is proposed to conserve space and 
to minimize non-local communications within the 
chip. An alternative is to load all addresses and 
data into the head of the pipeline, and to receive 
information only from the foot, where results are 
This would eliminate non-local 
communications and may be a good thing in some 
applications. It is not clear that conventional 
sequential code would run well without a bus, par- 
ticularly when conditional branching is necessary. 
2) Conditional branches are handled by waiting un- 
til the branch condition is decided before loading 
more code. The option is to load and run as much 
code as possible, selecting the desired results 
after the branch condition is calculated. Running 
more than one branch direction would complicate 
the Conditional Branch logic. 3) Feedback in the 
pipeline is avoided to reduce non-local connections 
on the chip. Simple accumulators and more complex 
pipelines with feedback are not used in this arch- 
itecture, although feedback has certain advantages 
[5]. 4) Loops are implemented by loading the code 
within the loop repeatedly as it is being run. 
Loops are straightened; true loops are not used in 
this design. The option is to use feedback and 
microprogram loops; unfortunately, the pipeline 
would then be stopped until individual loops are 


satisfied. 5) CORDIC is proposed without the help 
of extra local registers; this simplifies the cir- 


cuitry and the microprogramming, but function eval- 
uation depends heavily on the memory bus. If 
CORDIC had to be called frequently, local registers 
for the 3 variables, and for the constants would be 
well worthwhile. 


The architecture presented for data-stationary 
instructions minimizes long distance communications 
on a chip; the approach is reasonably programmable. 
Assuming a modest floating point system with a pre- 
cision (N) equal to 8, a pipeline with L = 128 seg- 
ments should be quite useful. VLSI with hundreds 


of thousands of gates could support the pipelined 
structure; there would even be space for a small 
memory. Interfacing serially, 1 word per clock 
pulse, is within reason, allowing video to be pro- 
cessed for T.V., for example. Currently, a data- 
stationary architecture appears to be a good com- 
promise for VLSI, plus it is relatively easy to 
implement. 
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